Presentation is loading. Please wait.

Presentation is loading. Please wait.

July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

Similar presentations


Presentation on theme: "July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING."— Presentation transcript:

1 July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING

2 July 13, 2015ICS426: Introduction 2 Course Overview Introduction Data Preporcessing DW and OLAP Data Mining

3 July 13, 2015ICS426: Introduction 3 Motivation Data flood Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories There is a tremendous increase in the amount of data recorded and stored on digital media We are producing over two exabites (10^18) of data per year Storage capacity, for a fixed price, appears to be doubling approximately every 9 months Data stored in world’s databases doubles every 20 months Other growth rate estimates even higher

4 July 13, 2015ICS426: Introduction 4 Data, Data everywhere - yet... I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented I can’t use the data I found results are unexpected data needs to be transformed from one form to other

5 July 13, 2015ICS426: Introduction 5 Motivation Very little data will ever be looked at by a human. We are drowning in data, but starving for knowledge! “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all. Knowledge Discovery is NEEDED to make sense and use of data. Solution: Data warehousing and data mining Data warehousing and On-Line Analytical Processing (OLAP) Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

6 July 13, 2015ICS426: Introduction 6 Knowledge Discovery (KDD) Process

7 July 13, 2015ICS426: Introduction 7 KDD Process: Several Key Steps Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Data mining summarization, classification, regression, association, clustering Pattern evaluation and knowledge presentation Use of discovered knowledge

8 July 13, 2015ICS426: Introduction 8 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

9 July 13, 2015ICS426: Introduction 9 What are the users saying... Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required

10 July 13, 2015ICS426: Introduction 10 What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Information

11 July 13, 2015ICS426: Introduction 11 Evolution 60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request 70’s: Terminal-based DSS and EIS (executive information systems) still inflexible, not integrated with desktop tools 80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational databases 90’s: Data warehousing with integrated OLAP engines and tools 2000’s: Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems

12 July 13, 2015ICS426: Introduction 12 Very Large Data Bases Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos

13 July 13, 2015ICS426: Introduction 13 Data Warehousing -- It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database

14 July 13, 2015ICS426: Introduction 14 Data Warehouse A data warehouse is a subject-oriented integrated time-varying non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996

15 July 13, 2015ICS426: Introduction 15 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems

16 July 13, 2015ICS426: Introduction 16 Data Warehouse for Decision Support & OLAP Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years?

17 July 13, 2015ICS426: Introduction 17 Decision Support Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements

18 July 13, 2015ICS426: Introduction 18 Data Mining works with Warehouse Data Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence

19 July 13, 2015ICS426: Introduction 19 Why Data Mining Credit ratings/targeted marketing: Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Fraud detection Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Customer relationship management: Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : Data Mining helps extract such information

20 July 13, 2015ICS426: Introduction 20 Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? What impact will new products/services have on revenue and margins? What product prom- -otions have the biggest impact on revenue? What is the most effective distribution channel? Why DM: A producer wants to know….

21 July 13, 2015ICS426: Introduction 21 What is Data Mining? Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from huge amount of data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

22 July 13, 2015ICS426: Introduction 22 Data Mining: Confluence of Multiple Disciplines ? 20x20 ~ 2^400  10^120 patterns

23 July 13, 2015ICS426: Introduction 23 Some basic operations Predictive: Regression Classification Collaborative Filtering Descriptive: Clustering / similarity matching Association rules and variants Deviation detection

24 July 13, 2015ICS426: Introduction 24 Applications … Banking: loan/credit card approval predict good customers based on old customers Customer relationship management: identify those who are likely to leave for a competitor. Targeted marketing: identify likely responders to promotions Fraud detection: telecommunications, financial transactions from an online stream of event identify fraudulent events Manufacturing and production: automatically adjust knobs when process parameter changes

25 July 13, 2015ICS426: Introduction 25 … Applications Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis: identify new galaxies by searching for sub clusters Web site/store design and promotion: find affinity of visitor to pages and modify layout

26 July 13, 2015ICS426: Introduction 26 The course DS DW OLAP DM (2)(3) (4) Association Classification Clustering (5) (6) (7) DS = Data source DW = Data warehouse DM = Data Mining DP = Data processing DP

27 July 13, 2015ICS426: Introduction27 END


Download ppt "July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING."

Similar presentations


Ads by Google