Penn State Student Chapter of the Association for Computing Machinery We welcome all interested students to our 4th general meeting of the Spring 2005.

Slides:

Advertisements

Similar presentations

Advertisements

Lecture-7/ T. Nouf Almujally

Management Information Systems, Sixth Edition

1 Chapter 34 Data Mining Transparencies © Pearson Education Limited 1995, 2005.

Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.

Database – Part 3 Dr. V.T. Raja Oregon State University External References/Sources: Data Warehousing – Mr. Sakthi Angappamudali.

Managing Data Resources

Chapter 3 Database Management

7.1 © 2006 by Prentice Hall 7 Chapter Managing Data Resources.

Database – Part 2b Dr. V.T. Raja Oregon State University External References/Sources: Data Warehousing – Sakthi Angappamudali at Standard Insurance; BI.

Organizing Data & Information

McGraw-Hill/Irwin Copyright © 2008, The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin Copyright © 2008 The McGraw-Hill Companies, Inc.

Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.

Database – Part 2 Dr. V.T. Raja Oregon State University.

Introduction to Database Management

Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,

Chapter 13 The Data Warehouse

1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.

Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.

Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.

Designing a Data Warehouse

Data Warehousing: Defined and Its Applications Pete Johnson April 2002.

Chapter 35 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,

Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.

Dr. Awad Khalil Computer Science Department AUC

1 Data Mining DT211 4 Refer to Connolly and Begg 4ed.

Database Systems – Data Warehousing

The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.

Database Design – Lecture 16

Fundamentals of Information Systems, Fifth Edition

Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.

Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.

Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.

1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.

7.1 Managing Data Resources Chapter 7 Essentials of Management Information Systems, 6e Chapter 7 Managing Data Resources © 2005 by Prentice Hall.

I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.

Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.

1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…

The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.

IST Data Warehousing. IST Data Rich, but Information Poor Data is stored, not explored : by its volume and complexity it represents a burden,

C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.

5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Computer Software Chapter 4.

Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.

CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.

The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.

DATA RESOURCE MANAGEMENT

Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,

Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.

Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.

McGraw-Hill/Irwin ©2008,The McGraw-Hill Companies, All Rights Reserved Chapter 5 Data Resource Management.

Managing Data Resources File Organization and databases for business information systems.

Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.

Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.

Data Mining Transparencies

Outline Types of Databases and Database Applications Basic Definitions

Defining Data Warehouse Concepts and Terminology

Chapter 13 The Data Warehouse

Defining Data Warehouse Concepts and Terminology

MANAGING DATA RESOURCES

Data Warehouse Overview September 28, 2012 presented by Terry Bilskie

MANAGING DATA RESOURCES

Introduction of Week 9 Return assignment 5-2

Chapter 3 Database Management

Data Mining The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make.

Presentation transcript:

Penn State Student Chapter of the Association for Computing Machinery We welcome all interested students to our 4th general meeting of the Spring 2005 semester! When: Monday, April 11th, 2005 from 7-8 pm Where: Cybertorium (213 IST) Agenda: Brief overview of our ACM chapter New officer introductions Special topic presentation: No Pain, No Game Presented by IST Professor Brian K. Smith Co-op/Intern presentation: Working at IBM Presented by Rick Osowski Free refreshments will be provided

IST Data Warehousing, Data Mining, and Advanced Applications

IST Data Rich, but Information Poor Data is stored, not explored : by its volume and complexity it represents a burden, not a support Data overload results in uninformed decisions, contradictory information, higher overhead, wrong decisions, increased costs Data is not designed and is not structured for successful management decision making

IST Improving Decision Making Data Information Decisions Data Warehouse

IST Data Warehouse Concepts

IST What’s a Data Warehouse? A data warehouse is a single, integrated source of decision support information formed by collecting data from multiple sources, internal to the organization as well as external, and transforming and summarising this information to enable improved decision making. A data warehouse is designed for easy access by users to large amounts of information, and data access is typically supported by specialized analytical tools and applications.

IST Data Warehouse Characteristics  Key Characteristics of a Data Warehouse  Subject-oriented  Integrated  Time-variant  Non-volatile

IST Subject Oriented Example for an insurance company : Policy Customer Data Losses Premium Commercial and Life Insurance Systems Auto and Fire Policy Processing Systems Data Accounting System Claims Processing System Billing System Applications AreaData Warehouse

IST Integrated Data is stored once in a single integrated location (e.g. insurance company) Data Warehouse Database Subject = Customer Auto Policy Processing System Auto Policy Processing System Customer data stored in several databases Fire Policy Processing System Fire Policy Processing System FACTS, LIFE Commercial, Accounting Applications FACTS, LIFE Commercial, Accounting Applications

IST Time - Variant  Data is tagged with some element of time - creation date, as of date, etc.  Data is available on-line for long periods of time for trend analysis and forecasting. For example, five or more years Data Warehouse Data TimeData { Key Data is stored as a series of snapshots or views which record how it is collected across time.

IST Non-Volatile Existing data in the warehouse is not overwritten or updated. External Sources Read-Only Data Warehouse Database Data Warehouse Environment Data Warehouse Environment Production Databases Production Applications Production Applications Update Insert Delete Load

IST Transaction System vs. Data Warehouse

IST On-line, real time update into disparate systems Day-to-day operationsSystem Experts UsersData Manipulation Unix VMS MVS Other Transaction-Based Reporting System

IST BENEFIT: Reduce data processing costs BENEFIT: Integrated, consistent data available for analysis BENEFIT: Improve Network Reporting processes and analytical capabilities Data Staging, Transformation and Cleansing Data Staging, Transformation and Cleansing Interfaces Executive Reporting and On-Line Analysis Environment Other VMS MVS Unix Summarization OLAP Data Warehouse Warehouse-Based Reporting System

IST Transaction - Warehouse Process Transform Summarize & Refine On-line, real time update. “Transaction Based Process” Day-to-day operations Detailed Information to operational systems. “Warehouse Based Process” Decision support for management use. Batch Load

IST  Supports management analysis and decision- making processes  Contains summarized, refined, and cleansed information  Non-volatile -- provides a data “snapshot”; adjustments are not permitted, or are limited  Business analysis requirements drive the data structure and system design  Integrated, consistent information on a single technology platform  Users have direct, fast access via On-line Analytical Processing tools  Minimal impact on operational processes  Data Warehouse  Supports day-to-day operational processes  Contains raw, detailed data that has not been refined or cleansed  Volatile -- data changes from day-to-day, with frequent updates  Technical issues drive the data structure and system design  Disparate data structures, physical locations, query types, etc.  Users rely on technical analysts for reporting needs  Operational processes impacted by queries run off of system  Transaction System Transaction System vs. Data Warehouse

IST Data Warehouse Architecture

IST Data Warehouse Architecture Conversion & Interface OLAP Cubes Ad-hoc Reporting Canned Reports Data Marts Staging Area ODS Operational SystemData Warehouse

IST  Map source data to target  Data scrubbing  Derive new data  Data Extraction  Transform / convert data  Create / modify metadata Conversion & Cleansing Data Warehouse Architecture Conversion and Cleansing Activities

IST Detailed Data Metadata  Ranges from detailed to summarized data  Contains metadata  Many views of the data  Subject-Oriented  Time-variant Summary Data Data Warehouse Architecture Data Warehouse Components

IST Requirements Gathering Process Business Measure Definition  Standard definition and related business rules and formulas  Source data element(s), including quality constraints  Data granularity levels (e.g., county detail for state)  Data retention (e.g., one month, one quarter, one year, multiple years)  Priority of the information (For example, is the information necessary to derive other business measures?)  Data load frequency (e.g., monthly, quarterly, etc.)

IST Star Join Schema Region_Dimension_Table region _id NE NW SE SW region _doc Northeast Northwest Southeast Southwest account _id account _doc ABC Electronics Midway Electric Victor Components Washburn, Inc. Zerox Account_Dimension_Table Product_Dimension_Table prod_grp_id prod_id prod_grp_desc Fewer devices Circuit boards Components prod_desc Power supply Motherboard Co-processor month mo_in_fiscal_yr month_name January February March prod_id region_id SW NE SW account_id vend_id net-sales 30,000 23,000 32,000 gross_sales 50,000 42,000 49,000 Monthly_Sales_Summary_Table Time_Dimension_Table Fact Table Dimension Tables Vendor_Dimension_Table vend_id vendor_desc PowerAge, Inc. Advanced Micro Devices Farad Incorporated month

IST Multi-Dimensional Analysis

IST Application Solution Classes Executive information system (EIS) : Present information at the highest level of summarization using corporate business measures. They are designed for extreme ease-of- use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance Decision Support Systems (DSS) : They ideally present information in graphical and tabular form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented

IST Data Mining

IST Data Mining The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996). Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.

IST Data Mining Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive. Patterns and relationships are identified by examining the underlying rules and features in the data. Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Relatively new technology, however already used in a number of industries.

IST Examples of Applications of Data Mining Retail / Marketing Identifying buying patterns of customers Finding associations among customer demographic characteristics Predicting response to mailing campaigns Market basket analysis Banking Detecting patterns of fraudulent credit card use Identifying loyal customers Predicting customers likely to change their credit card affiliation Determining credit card spending by customer groups

IST Examples of Applications of Data Mining Insurance Claims analysis Predicting which customers will buy new policies Medicine Characterizing patient behavior to predict surgery visits Identifying successful medical therapies for different illnesses

IST Data Mining Operations and Associated Techniques

IST Database Segmentation Aim is to partition a database into an unknown number of segments, or clusters, of similar records. Uses unsupervised learning to discover homogeneous sub- populations in a database to improve the accuracy of the profiles. Less precise than other operations thus less sensitive to redundant and irrelevant features. Sensitivity can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable. Applications of database segmentation include customer profiling, direct marketing, and cross selling.

IST Scatterplot

IST Visualization

IST Data Mining and Data Warehousing Major challenge to exploit data mining is identifying suitable data to mine. Data mining requires single, separate, clean, integrated, and self-consistent source of data. A data warehouse is well equipped for providing data for mining. Data quality and consistency is a pre-requisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data.

IST Data Mining and Data Warehousing It is advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouses contain data from a number of sources. Selecting the relevant subsets of records and fields for data mining requires the query capabilities of the data warehouse. The results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide the capability to go back to the data source.

IST Advanced Database Topics

IST A Little History Prior to the 1980s  hierarchical and network databases. Hardware  dumb terminals using private networks Database  centralized and stored on the disk packs End user terminals  simply input/output devices  Processing at the mainframe Data  text data Networks had to handle text data No access from outside to the organization's private network.

IST Microcomputer enabled workstation processing power. Satellite and network technology provided for very high speed, high traffic, and low cost long distance communications networks. Internet in the late 1990s and the corresponding phenomenal growth in electronic commerce (E- commerce) necessitated public access to data in people's homes. The volume of data needed to be transmitted increased greatly. New Needs

IST Business environment changed during the last two decades Information stored at different locations, on different hardware and operating systems, with different commercial DBMS products, and with different underlying data models had to be combined The centralized database was no longer feasible to handle these new demands New Needs

IST Distributed Database Scenario There are many advantages to using a distributed database rather than a centralized database. They are: Improved performance, because high traffic data are stored locally. More efficient data management, because the DBA workload is shared. Better network integrity, because the whole system does not stop if one computer goes down. Expansion of the database is facilitated when the organization grows, since new data does not have to be centralized. It can remain and be administered in the original location. Data for the whole organization can still be accessed from any location.

IST Data administration is improved (??) In a distributed database system even a simple task like creating a backup copy of the database can take a considerable amount of time. If the database is divided among several locations the time and workload for this task can be shared. Distributed Database

IST Replication of Data System failure in one location should not stop processing in other locations Replicate all or parts of the database in more than one location. Database replication improves performance and provides a fail- safe option, but it involves considerable complexity Replication of frequently used data improves response time and reduces network traffic If the data changes at one location it must be changed at all locations

IST Distributed Systems in an Ideal World C. J. Date established rules for the ideal distributed DBMS system Rules are a goal that distributed systems strive toward, but have not yet reached According to Date's rules: Each site is responsible for its own portion of the distributed database, including security, backup, and recovery. Each site has equal capabilities and does not rely on any other site. The system should work regardless of the computer hardware, operating system, or network installed at any site.

IST Date's Rules of Distributed Databases: 1. Local site independence 2. Central site independence 3. Failure independence 4. Location transparency 5. Fragmentation transparency 6. Replication transparency 7. Distributed query processing 8. Distributed transaction processing 9. Hardware independence 10. Operating system independence 11. Network independence 12. Database independence

IST Complexities of Distributed Databases There also are many complications involved in the management of distributed database systems. The distributed database must be carefully designed to insure the following: Store data as close as possible to where it is used most often. Make the location of the data transparent to the end user. Make the system easy to expand. Optimize queries to improve response time in the distributed environment.

IST Database Design The designer must analyze the organization's needs and business processes to determine the best way to distribute the database. There are several possibilities for storing the data in more than one location: Centralized master database Replication of the entire or part of the database in several locations Horizontal partitions Vertical partitions Mixture of the above

IST Fragmentation Horizontal fragmentation of the database means that rows of a table(s) may be stored in different locations Similar to the separation of the customer table in the retailing example above. Vertical fragmentation means that columns of a table ( i.e., attributes or groups of attributes of an entity) are stored in different locations.

IST Query Formulation Distributed databases require a considerable amount of network overhead Poorly formulated query it may cause unnecessary data retrieval from the database Query optimization is ideally performed by the distributed database management system

IST OODB In traditional relational databases E-R Modeling and normalization focuses on identifying entities, their attributes, and the relationships between entities This works well for most organizational data, especially business data The advent of the microcomputer and processing power on the desktop Computer aided design, CAD, became the norm for engineering work, so it became necessary to store drawings Powerful multimedia PCs with sound cards and color monitors enabled the manipulation of sound and video files Many other applications were developed that required more than just text and numeric processing

IST Why?? These new applications were facilitated by the development of Object-Oriented Programming Still evolving development of object-oriented data modeling, object-oriented databases, and object-oriented database management systems OODBMS and O/R DBMS are two types of database management systems that are currently available O/R DBMS uses the basic theory of relational database management systems with object-oriented features added OODBMS is more object-oriented and was developed separately from the relational products OODMBS suffers from a lack of standardization that is available with relational database systems