1 Intro: What is datamining? Data are generated in large amount. E.g. transactions, telephone calls. Data is collected because believed to be a potential.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS
Database Systems: Design, Implementation, and Management Tenth Edition
Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew.
Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases Presented by Darren Gates for ICS 280.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Classifier Decision Tree A decision tree classifies data by predicting the label for each record. The first element of the tree is the root node, representing.
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 10 Managing a Database.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
1 Chapter 4 The Fundamentals of VBA, Macros, and Command Bars.
Living in a Digital World Discovering Computers 2010.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Chapter 14 The Second Component: The Database.
Data Mining – Intro.
Academic Year 2014 Spring.
WPI Center for Research in Exploratory Data and Information Analysis From Data to Knowledge: Exploring Industrial, Scientific, and Commercial Databases.
2 1 Chapter 2 Data Model Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Discovering Computers Fundamentals, 2012 Edition Your Interactive Guide to the Digital World.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
Perception-Based Classification (PBC) System Salvador Ledezma April 25, 2002.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
OnLine Analytical Processing (OLAP)
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Visualization Blaz Zupan Faculty of Computer & Info Science University of Ljubljana, Slovenia.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Describe the qualities of valuable information.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
An Internet of Things: People, Processes, and Products in the Spotfire Cloud Library Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Operation Data Analysis Hints and Guidelines EGN 5621 Enterprise Systems Collaboration Summer B, 2014.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
CS3041 – Final week Today: Searching and Visualization Friday: Software tools –Study guide distributed (in class only) Monday: Social Imps –Study guide.
VizDB A tool to support Exploration of large databases By using Human Visual System To analyze mid-size to large data.
Pad++: A Zooming Graphical Interface for Exploring Alternate Interface Physics Presented By: Daniel Loewus-Deitch.
Today’s Goals Answer questions about homework and lecture 2 Understand what a query is Understand how to create simple queries using Microsoft Access 2007.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Importance of user interface design – Useful, useable, used Three golden rules – Place the user in control – Reduce the user’s memory load – Make the.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
DataJewel 1 : Tightly Integrating Visualization with Temporal Data Mining Mihael Ankerst, David H. Jones, Anne Kao, Changzhou Wang 1 US patent pending.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
1 INTRODUCTION TO COMPUTER GRAPHICS. Computer Graphics The computer is an information processing machine. It is a tool for storing, manipulating and correlating.
Managing Data Resources File Organization and databases for business information systems.
Data Mining – Intro.
Operation Data Analysis Hints and Guidelines
Chapter 13 Business Intelligence and Data Warehouses
Chapter Ten Managing a Database.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Chapter 2 Database Environment.
MANAGING DATA RESOURCES
Analysis models and design models
Database Systems Instructor Name: Lecture-3.
CHAPTER 7: Information Visualization
Presentation transcript:

1 Intro: What is datamining? Data are generated in large amount. E.g. transactions, telephone calls. Data is collected because believed to be a potential source of valuable info. Datamining is finding useful and interesting info from the data. Data can be "large" in two ways: width and height of dataset. At the beginning, we have the computer analyze the data and spit out result in text... Now we're moving towards "human-centred datamining," and visualization is one tool to do so.

2 Information Visualization and Visual Data Mining, Keim, IEEE Transactions on Visualization and Computer Graphics 8(1), DataJewel: Tightly Integrating Visualization with Temporal Data Mining, Mihael Ankerst, David H. Jones, Anne Kao, Changzhou Wang. ICDM Workshop on Visual Data Mining, Melbourne, FL, 2003 [Archived version] DEVise: Integrated Querying and Visual Exploration of Large Datasets, Miron Livny, Raghu Ramakrishnan, Kevin Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, and Kent Wenger. Proc. SIGMOD 1997.

3 Visual data mining: include the human in the data exploration process Combines 1) the flexibility, creativity and general knowledge of the human and 2)Enormous storage capacity and computational power of computers

4 Classification of Visual Data Mining Techniques 1) Data type to be visualized (6) 2) Visualization technique (5) 3) Interaction and distortion technique (5) These 3 dimensions of classification can be assumed orthogonal

5 1. Data type to be visualized (1/2) 1.1) 1-D data, usually the dimension is very dense. E.g. temporal data, like time series of stock prices. 1.2) 2-D data. E.g. geographical maps 1.3) Multi-Dimension E.g. tables from relational databases No simple mapping of attributes to the two dimensions of the screen

6 1. Data type to be visualized (2/2) 1.4) Text and hypertext, e.g. news articles Most of the standard visualization techniques cannot be applied. In most cases, a transformation of the data into description vectors is necessary first. E.g. word counting, then principal component analysis. 1.5) Hierarchies and graphs E.g. telephone calls 1.6) Algorithms and software E.g. for debugging operations

7 2. Visualization technique 2.1) standard 2D/3D displays e.g. bar charts and x-y plots. 2.2) geometrically transformed displays e.g. parallel coordinates. 2.3) icon-based displays (glyphs) 2.4) dense pixel displays

8 2.5) stacked displays Tailored to present data partitioned in a hierarchical fashion. Embed one coordinate system inside another coordinate system. Figure: by M. Ward, Worchestor Polytechnic

9 3. Interaction and distortion technique (1/2) Dynamic: changes to visualizations are made automatically Interactive: changes are made manually 3.1) Dynamic projections e.g. To show all interesting two-dimensional projections of a multi-dimensional dataset as a series of scatter plots. 3.2) Interactive filtering browsing: direct selection of desired subset querying: specify properties of desired subsets

10 3. Interaction and distortion technique (2/2) 3.3) Interactive zooming On higher zoom levels, more details are shown. 3.4) Interactive distortion Show portions of the data with high level of detail while other s are shown with lower. E.g. spherical distortion and fisheye views. 3.5) Interactive Linking and Brushing –Combine different visualization methods to overcome the shortcomings of single techniques. –Changes to one visualization are automatically reflected in the other visualization.

11 Critiques +Good summary of visual datamining and InfoVis in general. +Nice all-around introductory material. Concise. +Great references. Supported his classifications with ample examples, and cites figures from other papers. "see Fig. 5 in [10]" +Good amount of pictures

12 Information Visualization and Visual Data Mining, Daniel A. Keim, IEEE Transactions on Visualization and Computer Graphics 8(1), DataJewel: Tightly Integrating Visualization with Temporal Data Mining Mihael Ankerst, David H. Jones, Anne Kao, Changzhou Wang. ICDM Workshop on Visual Data Mining, Melbourne, FL, 2003 [Archived version] DEVise: Integrated Querying and Visual Exploration of Large Datasets Miron Livny, Raghu Ramakrishnan, Kevin Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, and Kent Wenger. Proc. SIGMOD 1997.

13 DataJewel Main contribution: The DataJewel architecture tightly integrates a visualization component, an algorithmic component and a database component for temporal data mining. Bridge the field of InfoVis with other research communities e.g. datamining. 2 aspects of temporal data mining: Need to add new mining algorithms easily; need to link tables together that have no primary key.

14 User-centric Data Mining (1/3) The mining process is recursive At least one attribute contains a timestamp for each record. Call it "event date". All attributes are "event attributes" Attribute values are "events"

15 User-centric Data Mining (2/3) Assumptions: a) number of event attributes is low. (<10) Often, in one given analysis, the analyst selects a small number of event attributes which can be associated with each other in a particular domain. b) number of different events of one event attribute is moderate. (<200) If this is not true, a concept of hierarchy can be defined for the event attribute. c) smallest time unit of interest in the event dates is one day

16 User-centric Data Mining (3/3) Using the above assumptions, one instance of the visualization and the algorithmic component are presented, and new ones can be easily integrated.

17 Visualization component: CalendarView Multi-Dimensional, with Even Date as the "key" Web-mining example:

18 [A dense pixel display and a stacked display and Linking and Brushing]

19 Interaction with CalendarView Selection: selected subset can be visualized following the iterative process Descending/Ascending order: good for finding "main" events and outlier events. [Interactive filtering and interactive zooming]

20 Temporal Mining Component These algorithms assign colour to events to allow users to observe patterns easily in the CalendarView. LongestStreak: Discover one event of one event attribute with the longest consecutive streak of significant days. (What about the longest N streaks?) MatchingEvents extends LongestStreak: Return the LongestStreak event and the correlated event. MatchingEvents2: returns the LongestStreak of the first event attribute and for each other event attribute, the event that is correlated.

21 Database Component (1/3) This component provide access to datasets in tables from relational database(s). The critical task is to scale up to large databases. Compute an aggregated version of the dataset such that it fits in main memory. Query:

22 Generate "Sufficient statistics" for event attribute page_hits Before After Database Component (2/3)

23 Database Component (3/3) mem_init = c * number of days * average number of events per day (= 402 in aircraft maintenance domain for one airline) mem_new = c * number of days * average number of distinct events per day (= 32) Summary statistics always fit in main memory and the computation of the proposed algorithm is efficient. Authors believe it is true for most datasets which fulfill their assumptions. E.g. number of event attributes is low (<10).

24 Experiment with airplane maintenance datasets (1/2) Pentium III/800Mhz and 1 GB main memory Datasets span years, with sufficient statistics fit in main memory 1) LongestStreak finds a system of an airplane: "engine fuel". During the last five days of July 2000, we perceive many events, indicating problems with engine fuel.

25 Experiment with airplane maintenance datasets (2/2) 2) Add several datasets to compare this finding. Manually colour every system except engine fuel with one light colour and a dark colour to all engine fuel related events: Pattern is not present. 3) Run MatchingEvents2 to single out one airplane, which has a lot of maintenance events ion Dec 3rd, ) Finally, select a dataset with maintenance events of just this plane. MatchingEvents algorithm finds fuel and communications events frequently co- occur. E.g. on Monday 18th, Nov. 5) Drill down to the raw data to further investigate.

26 Concluding remark Author believes the DataJewel architecture is also well adapted to areas like homeland security, market basket analysis, or intrusion detection.

27 Critique +Good example domains with which the DataJewel system is useful +Step-by-step procedure of a datamining session on airline maintenance example -How really useful is an architecture? To use DataJewel on other domains, still need to provide algorithm, visualization (and of course dataset). -Somewhat strong assumptions +The proposed algorithms can finish within 1 second -- this is over 10 years of airline maintenance data. Not bad. -But the run time for the system as a whole -- making the sufficient statistics table and rendering is not discussed.

28 Information Visualization and Visual Data Mining, Daniel A. Keim, IEEE Transactions on Visualization and Computer Graphics 8(1), DataJewel: Tightly Integrating Visualization with Temporal Data Mining, Mihael Ankerst, David H. Jones, Anne Kao, Changzhou Wang. ICDM Workshop on Visual Data Mining, Melbourne, FL, 2003 [Archived version] DEVise: Integrated Querying and Visual Exploration of Large Datasets, Miron Livny, Raghu Ramakrishnan, Kevin Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, and Kent Wenger. Proc. SIGMOD 1997.

29 DEVise DEVise is a data exploration system that allows users to easily develop, browse, and share visual presentations of large tabular datasets from several sources. [Multi-dimensional datasets] The framework has been already successfully applied to a variety of real applications.

30 Main contributions (1/2) 1) Visual Presentation Capabilities remarkable variety to be developed easily through a point-and-click or easy-to-write 'plugins' 2) Ability to handle large (bigger than main memory), distributed (e.g. over the Web) dataset by using a declarative approach to define their visualization primitives, instead of a programming-oriented style. 3) Collaborative data analysis: several users can share visual presentations of the data and dynamically explore these presentations.

31 Main contributions (2/2) Visual querying from a variety of local and remote sources. From the visual representations being used, the system can dynamically gather hints for what to index, materialize, cache or re-compute.

32 Examples Financial data exploration in the UW Business school: look for correlations and trends using the combined information from a variety of vendors. R-tree validation: discover subtle bugs in the R- tree bulk loading algorithms. Family Medicine and NCDC Weather Data: used by the UW Family Medicine department to provide physicians access to data that is collected and maintained independently by several clinics and also weather data from National Climate Data Center. Soil Sciences Classification: the BOREAS field experiment.

33 Visualization Model (1/2) It is based on mapping each source data record to a visual symbol on screen. "Plotting the data record" on some sort of graph. [standard 2D/3D displays] Source data called TData (tabular data) GData (graphical data) is the visualization with attributes x, y, size, color, etc. Mapping: a function that produces a GData record from a TData record. This is data-independent. Only depend on the TData schema (table column headings, variable types of the columbs)

34 Visualization Model (2/2) View: the basic display unit in DEVise, consists of 3 layers: background, data display, and cursor display. Background and cursor display are data- independent. Each view has a mapping, TData, and a visual filter. A visual filter is a set of selections on the GData attributes. E.g. a range of x and y. A visual filter is ultimately translated to a query VGData: visible GData. This is computed from TData and is the data-dependent portion of a view. View template: the data-independent portion

35 Coordination views (1/2) [Interactive linking and brushing] 2 mechanisms: Cursors and links A cursor allows the visual filter of one view (source view) to be seen as a high light in another view destination view). This is bi-directional. Visual link: visual filters of two views have share attributes. E.g. visual link on the x axis. Record link (positive or negative): a set of common TData attributes. The projection of the VGData on the linked attributes of the first linked view (the master) acts as a filter on the TData of the second linked view (the slave).

36 Visual link on X axis Record link on DID from V6 to V1

37 Coordination views (2/2) Operator link: an operator (such as union, intersection) is applied to VGData(s) of link masters and creates a TData for the link slave. Aggregate link: the second view visualizes some aggregate function, e.g., sum and average.

38 Another Matrix reference! “Operator Link” – Matrix Reloaded

39 Organizing complex visual presentations A windows: collection of views together with the set of cursors and links A visual presentation: a collection of windows plus a collection of links and cursors. A visual template: the data-independent portion of a visual presentation.

40 Visual Queries (1/2) 1) op1: changing the x-y ranges. 2) op2: click and display the actual TData record 3) op3: Move a cursor A query (called a linked query) maybe be generated as a side-effect of a visual query.

41 Effect of op1 in the presence of Visual Link on the X axis

42 Visual Queries (2/2) Links and cursors and visual queries can be defined in terms of relationship operators (selection, projection and function composition) on TData

43 Example: Visual links on attribute L

44 Visual Queries and SQL (1/3) Allows users who are not database experts to generate sophisticated SQL queries through intuitive graphical operations. Let T be a set of TData records (latitude, longitude, orders, totalamount) View 1 has a mapping that gives a scatter plot of totalamount vs. latitude. View 2 has a mapping that gives a scatter plot of order vs. latitude. The equivalent SQL queries are SELECT (totalamount, latitude) FROM T SELECT (order, latitude) FROM T

45 Visual Queries and SQL (2/3) A visual link on the x attribute: SELECT (totalamount, latitude, orders) FROM T A 'rubberband query' on View 1 which restricts the range of x and y < y < AND 30 < x < 40 on View 1 30 < x < 40 on View 2 Equivalent SQL queries: SELECT (totalamount, latitude) FROM T WHERE (10000 < TOTALAMOUNT < 20000) AND (30 < latitude < 40) SELECT (orders, latitude) FROM T WHERE (30 < latitude < 40)

46 Visual Queries and SQL (3/3) Vice versa, an SQL query can be expressed using a visual presentation. Queries can operate on both local and remote data sources. This is exploited by DEVise. Evaluate query at remote sites if supported Otherwise retrieve complete relations and do the rest locally.

47 Advanced Exploration Tasks (1/2) Integrated Access to Data and Metadata When datasets are very large and too much information is lost by compression, a powerful paradigm is to let users create summaries of data and to browse the summaries. E.g. statistical measures over subsets of the data. Support is built directly into the current version of DEVise.

48 Advanced Exploration Tasks (2/2) Collaborative Analysis A user can save a visual template (the data- independent part) and send it to another user. Such a visual template is called an "active report". Future work: Share a visual representation and changes made by one user are automatically seen by all users.

49 Critiques +Well developed and evolving system with a lot of real applications and many feedback from domain experts +I like visual querying of large database that doesn't fit in main memory and then displaying the result visually. -The simple x-y plot and bar graph are limiting. -A visual presentation with 6 windows and 10 views in total might be disorienting.

50