Promising “Newer” Technologies to Cope with the

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Advertisements

CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Introduction (based on.
Data Mining: Concepts and Techniques
Dr. Tahar Kechadi Dr. Joe Carthy
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data Mining.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
CS-470: Data Mining Fall Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science.
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Chapter 1. Introduction Motivation: Why data mining?
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Data Mining: Concepts and Techniques
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Data Mining Techniques As Tools for Analysis of Customer Behavior Lecture 2:
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Chapter 1 Introduction to Data Mining
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 1 Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting!
October 18, 2015 Data Mining: Concepts and Techniques 1 DATA MINING Motivation: Why data mining? What is data mining? Data Mining: On what kind of data?
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.
Han: Introduction to KDD 1 Introduction to Knowledge Discovery and Data Mining ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Conclusions. Why Data Mining? -- Potential Applications Database analysis and decision support – Market analysis and management target marketing, customer.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Warehousing/Mining 1. 2 Chapter 1. Introduction v Motivation: Why data mining? v What is data mining? v Data Mining: On what kind of data? v Data.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining.
Data Mining – Intro.
Data Mining Motivation: “Necessity is the Mother of Invention”
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING BY: PRADEEP AGRAWAL MBA (SEC – A) ALLIANCE UNIVERSITY – SCHOOL OF BUSINESS.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Adrian Tuhtan CS157A Section1
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Course Outline
©Jiawei Han and Micheline Kamber
Sangeeta Devadiga CS 157B, Spring 2007
Introduction --- Part2 Another Introduction to Data Mining
©Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques — Slides for Textbook —
©Jiawei Han and Micheline Kamber
Data Warehousing and Data Mining
Ceng 714 Data Mining Introduction
Data Mining Introduction
©Jiawei Han and Micheline Kamber Department of Computer Science
Data Mining: Concepts and Techniques
Supporting End-User Access
Data Mining: Concepts and Techniques
Data Mining Concepts and Techniques
©Jiawei Han and Micheline Kamber
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
CSE591: Data Mining by H. Liu
Promising “Newer” Technologies to Cope with the
Presentation transcript:

Promising “Newer” Technologies to Cope with the Information Flood Knowledge Discovery and Data Mining (KDD) Agent-based Technologies Ontologies and Knowledge Brokering Non-traditional data analysis techniques Model Generation As an Example To Explain / Discuss Technologies As I mentioned in the introduction the goal of this talk is to introduce and describe newer technologies that in my opinion show some promise to cope with the information flood in health care. In this talk, I will focus on 3 particular technologies, namely, ... Moreover, during the course of the talk I will not only introduce the technologies but also analyze how they can fertilize each other’s application.

Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html) Field is more dominated by industry than by research institutions The first technology I like to … The above picture is, in my opinion, a good description of the task of knowledge discovery in that it illustrates a huge search space that contains very very few interesting things, and if applied in practice, KDD is frequently like finding a needle in a hay stack, except that you are not sure what you are looking for...

Making Sense of Data --- Knowledge Discovery and Data Mining Introduction to KDD (1 class) Data Warehouses and OLAP (2 classes) Association Rule Mining (1.5 classes) Learning to Classify (1 class) Other Techniques: Clustering, Deviation Detection, Sequential Pattern Analysis (0.5 classes)

Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines

Interpret/Evaluate/Assimilate General KDD Steps Data sources Selected/Preprocessed data Transformed data Extracted information Knowledge Select/preprocess Transform Data mine Interpret/Evaluate/Assimilate Data preparation

Popular KDD-Tasks Classification (try to learn how to classify) Clustering (finding groups of similar object) Estimation and Prediction (try to learn a function that predicts an th value of a continuous output variable based on a set of input variables) Bayesian and Dependency Networks Deviation and Fraud Detection Text Mining Web Mining Visualization Transformation and Data Cleaning

KDD and Classical Data Analysis KDD is less focused than data analysis in that it looks for interesting patterns in data; classical data analysis centers on analyzing particular relationships in data. The notion of interestingness is a key concept in KDD. Classical data analysis centers more on generating and testing pre-structured hypothesis with respect to a given sample set. KDD is more centered on analyzing large volumes of data (many fields, many tuples, many tables, …). In a nutshell the the KDD-process consists of preprocessing (generating a target data set), data mining (finding something interesting in the data set), and post processing (representing the found pattern in understandable form and evaluated their usefulness in a particular domain); classical data analysis is less concerned with the the preprocessing step. KDD involves the collaboration between multiple disciplines: namely, statistics, AI, visualization, and databases. KDD employs non-traditional data analysis techniques (neural networks, association rules, decision trees, fuzzy logic, evolutionary computing,…).

Generating Models as an Example The goal of model generation (sometimes also called predictive data mining) is the creation, evaluation, and use of models to make predictions and to understand the relationships between various variables that are described in a data collection. Typical example application include: generate a model to that predicts a student’s academic performance based on the applicants data such as the applicant’s past grades, test scores, past degree,… generate a model that predicts (based on economic data) which stocks to sell, hold, and buy. generate a model to predict if a patient suffers from a particular disease based on a patient’s medical and other data. Model generation centers on deriving a function that can predict a variable using the values of other variables: v=f(a1,…,an) Neural networks, decision trees, naïve Bayesian classifiers and networks, regression analysis and many other statistical techniques, fuzzy logic and neuro-fuzzy systems, association rules are the most popular model generation tools in the KDD area. All model generation tools and environments employ the basic train-evaluate-predict cycle. In the next section of this talk, I will describe how agent-based architectures could be used to support knowledge discovery and data mining, centering on the task of model generation. The goal of this section is to give you an idea what it means to use for intelligent agents for software development, and to illustrate how using agents is different from a more traditional software development approach.

Why Do We Need so many Data Mining / Analysis Techniques? No generally good technique exists. Different methods make different assumptions with respect to the data set to be analyzed (to be discussed on the next transparency) Cross fertilization between different methods is desirable and frequently helpful in obtaining a deeper understanding of the analyzed dataset.

Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Why Data Mining? — Potential Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering

Market Analysis and Management Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information

Fraud Detection and Management Applications widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references

Other Applications Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

An OLAM Architecture Mining query Mining result OLAM Engine OLAP Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Database API Filtering&Integration Filtering Layer1 Data Repository Data cleaning Data Warehouse Databases Data integration

Example: Decision Tree Approach

Decision Tree Approach2

Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set

One Possibility age<30 city=sf car=van likely unlikely likely

Another Possibility car=taurus city=sf age<45 likely unlikely

Example: Nearest Neighbor Approach

Clustering income education age

Another Example: Text Each document is a vector e.g., <100110...> contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents sports international news business

Issues Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful? e.g., “yuppies’’ cluster? Using clusters for disk storage

Association Rule Mining transaction id customer id products bought sales records: market-basket data Example Rules: age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%] buys(x, p2) ^ buys(x,p5)  bus(x,p8) [1%, 85%]

Characteristics and Assumptions of Popular Data Mining/Analysis Techniques Distance based approaches (assume that a distance function with respect to the objects in the dataset exists) vs. order-based approaches (just use the ordering of values in their decision making; 3>2>1 is indistinguishable from 2.01>2>1.99) Approaches that make no assumptions / assume a particular distribution of the data in the underlying dataset. Differences in employed approximation techniques Rectangular vs. other approximation Linear vs. non-linear approximations Sensitivity to redundant attributes (variables) Sensitivity to irrelevant attributes Sensitivity to attributes of different degrees of importance Different Training Performance / Testing Performance What does the learnt function tell us about the analyzed data set? How difficult is it to understand the learnt function? Deterministic / non-deterministic approaches Stability of the obtained results

Summary KDD KDD: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Multi-disciplinary activity Important Issues: KDD-methodologies and user-interactions, scalability, tool use and tool integration, preprocessing, interpretation of results, finding good parameter settings when running data mining tools,…

Where to Find References? Data mining and KDD (SIGKDD member CDROM): Conference proceedings: KDD, and others, such as PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery Database field (SIGMOD member CD ROM): Conference proceedings: ACM-SIGMOD, ACM-PODS, VLDB, ICDE, EDBT, DASFAA Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc. AI and Machine Learning: Conference proceedings: Machine learning, AAAI, IJCAI, etc. Journals: Machine Learning, Artificial Intelligence, etc. Statistics: Conference proceedings: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization: Conference proceedings: CHI, etc. Journals: IEEE Trans. visualization and computer graphics, etc.