Understanding Data Warehousing

Understanding Data Warehousing
Introduction

Introduction Data has always been an essential ingredient to decision-making and, in modern business, the need to obtain, store, and use data has increased dramatically as the complexities and scope of the global marketplace has expanded. Data warehousing is an environment established for the sole purpose of gathering, integrating, and delivering data from across multiple data sources for use in enterprise decision-making. However, its effectiveness can be expanded to support any person, process, or system needing current and historical data which is consistent and relatable.

Defining Data Warehouse
Data warehouse is a computing environment composed of several technologies and products, including: Data Acquisition Data Management Data Modeling Data Quality Data Analysis Metadata Management Development Tools Storage Management Applications Administrative Functions Some people may have an incorrect impression of data warehouses: from thinking that it only supports the storage of data to thinking that it can be adequately supported using only one product or software. As you attempt to understand the nuances of data warehousing, it becomes apparent that it is more than it seems; many different factors can impact the effectiveness of a data warehouse solution. Copyright: The Art of Service 2008

Defining Data Warehouse (Part 2)
Data Warehousing is about managing the data. The following data features are key reasons for having a data warehouse: Subject Orientation Data Integration Non-volatile Time Variance Data Granularity The purpose of data warehousing is to provide an environment for managing data important to the enterprise. The following features are promoted by data warehousing in the data being managed. Subject-orientation – Operational systems will store data based on the application using the data. For instance, a human resource application may store payroll data on employees, but an education application may store skill assessments on employees. When data is collected from these applications to a data warehouse, the data is separated and integrated into real-world subjects regardless of application: a common subject for both of these applications is employee. Subjects are different for each enterprise implementing a data warehouse and indicate how the enterprise perceives its data. Data integration – Data can come from several sources (applications) of information and some of these sources may be external to the organization. In order to integrate data into subjects from several sources, two concerns become apparent: finding a common key, such as an employee number between sources, and resolving any conflicts in naming convention such as an employee’s name (e.g. John Doe, John S, Doe, Doe John, etc.). Data warehousing drives consistency in data which allows multiple enterprise systems to work together. Non-volatility – The data found in the data warehouse is for querying and analysis, not for daily operational use. As a result, the data warehouse is updated with operational data at scheduled and precise times, such as once a day, once a week, etc. In fact, different applications may require a different frequency in updating the data warehouse. Because of this separation of operational and analytical use of data, the data in the data warehouse is considered non-volatile. Time Variance – Because of the scenario just discussed, the data in the data warehouse is, in essence, historical data; however, by storing this data centrally, applications can query this data and relate it to current activities. For example, a retail application which makes recommendations for customers can use historical data on a customer’s purchases and relate it to those items that the customer is currently interested in. As a result, data found in the data warehouse has embedded variation from current information based on time. Data Granularity – Data kept in operational systems is typically stored and accessed at the lowest level of detail. While this level of detail may be stored in the data warehouse, when a user typically accesses the data, they access the data in a summarized form. If more detail is needed, then they can take on a deeper level for detail. From a business perspective, data summarization and granularity are important for putting information into context. For instance, consider hospital administration. You want to see the level of care given to patients over a month. Then you may see a summary of patients by department or illness. From this information, you may find that the number of people with gunshot wounds increased dramatically in the emergency room, so you focus on that information to find out the majority of instances accorded in a specific neighborhood or during a specific time of day. This in turn allows you to impact the personnel schedule to ensure that the emergency room is adequately staffed to cover any future incidents. Copyright: The Art of Service 2008

Benefits of Data Warehousing
Data Warehousing provides the following benefits: A comprehensive and integrated perspective of the enterprise Availability of current and historical information for strategic decision making Mitigating operational risks related to supporting the decision-making process Providing a flexible and interactive source of information Each of these benefits will be expanded throughout this presentation. Copyright: The Art of Service 2008

Introducing Business Intelligence
Business Intelligence is a set of disciplines designed specifically to establish a consistent decision-making environment. Business Intelligence does not replace Data Warehousing, but uses it extensively in it processes. Business Intelligence can be described as a two-step process: Transforming data into information Transforming information into knowledge The first step of the business intelligence process focuses on data extraction, integration, cleaning, and storage of the data. The process moves the data from the operational systems of the enterprise into the established data warehouse solution. The integration of the data from several operational systems provides the enterprise with a greater level of quality in information about the enterprise than any single system can do on its own. For instance, integrating employee data from systems such as payroll, education, resource management, and knowledge management can provide more information about the value of an employee’s current and perspective contribution to the organization. The second step focuses on analyzing the information found in the data warehousing to create knowledge about the enterprise to be used in making effective decisions. Customer data, for instance, is often analyzed to determine what products customers are looking for or what demographics should be marketed, to when releasing a new campaign for a product. Copyright: The Art of Service 2008

Functional Components of a Data Warehouse
Data Acquisition Data Storage Information Delivery The data warehouse environment supports three high-level functions: Data acquisition, data storage, and information delivery Copyright: The Art of Service 2008

Physical Components of a Data Warehouse
Control and Management Metadata Information Delivery Data Storage These functional components can be broken down into physical entities that create the data warehouse environment. The Data Acquisition function relies on locating data relevant to the business and can include operational data from production systems, internal data, archived data, or data from external sources. There are multiple data sources for a data warehouse solution to be directed. The data staging area is a holding area comprised of tools to extract, transform, and load the data from the data sources to the data warehouse. As part of this staging, the data from different sources are integrated and organized based on relevant subjects. The efforts of data staging will result in data which is sufficiently organized, and the creation of metadata for looking or understanding the data. Metadata is simply data about the data being stored in the data warehouse. It is a critical aspect of the data warehouse solution and is often the first thing that most people will see regarding the stored data. When a person asks a question about the enterprise such as “how many people purchased XYZ product?,” the answer will typically provide existing metadata. As the person drills down in a particular subject, more raw data will appear. The raw data and the metadata is typically stored independently from each other. In fact, planning of data storage will need to consider the storage requirements for data staging, metadata, and raw data. Information Delivery consists of the tools and interfaces which allow the data to be used by the enterprise, including query tools, analysis tools, and reporting. The Control and Management aspect are the administrative and governance functions of the data warehouse. Data Staging Data Sources Copyright: The Art of Service 2008

Source Data Data Sources can include:
Operational Data Internal Data Archived Data External Data Data can consist of structured or unstructured, prepared or raw formats. Data is everywhere. For a business, the majority of relevant data is operational, originating with production systems and applications used to support the business. Most operational systems will organize and manage current data on a limited basis; unfortunately, the systems only interpret the data based on its function or purpose in the business. Extracting this data to a data warehouse allows the data to be integrated and related to data from other systems. Additionally, queries and reports directly from operational systems are very narrow: it takes computing power to finish an extensive query. So, by housing the data within the data warehouse rather than the operational system, the business is minimizing the impact of decision-making on production. Internal data captures any information gathered outside of the operational systems, often in the form of documents, spreadsheets, and databases. Policies and procedures can be considered internal data, as well as profiles for employees, customers, and suppliers. Often, internal data is “private” because its information is gathered by a single person or small group, but its value is suspected to be greater if integrated with the larger enterprise data set. Archived data is any old operational or internal data available to the organization. Typically archived data is stored outside of the operational system which is interested in current data. Much instances of archived data may exist when the originating operational system no longer exists. Archived data may be found in a variety of formats, including disks, tapes, microfilm, paper, or the like, and may be stored offsite from the primary business location. External data simply focuses on any data generated outside of the organization. This type of data could be provided by national or international agencies, regulatory bodies, consultants, and marketing firms: it can be free or purchased. Because data can come from of any number of sources, the original state of the data is a concern for data extraction and staging. Data can be structured: that is, it is in a format easily interpreted by most tools because the data is a limited text string. Social security numbers, employee numbers, names, addresses are all considered structured data. About 25% of all available data is considered structured. Unstructured data requires special tools to interpret what exists, often requiring the data to be transformed into structured data. Videos, audio recordings, images, and complex text strings are all considered unstructured. The data that is extracted to the data warehouse may be found in a raw format or prepared format. If the data is in raw format, it will need to be prepared during data staging. Prepared data, theoretically, does not require any additional preparation, as long as the preparation is consistent with the rules, controls, and standards of the data warehouse. If not, the data may need to be “re-prepared” for the data warehouse. Copyright: The Art of Service 2008

Data Staging The activities of data staging are:
Extracting data from the data sources Transforming the data into usable information Loading the data and metadata into data storage ETL (Data Extraction, Transformation, and Loading) is considered the most time-consuming and human-intensive activities in data warehousing. Most of the work of data warehousing is performed during data staging and represents the first step of the business intelligence process. At this point, we know where the data is coming from (data sources). What must be decided is how is the data extracted, when, and how often? The answer to these questions is based on the individual data sources. The computing platforms for individual operating systems can influence are data is extracted, as well as data structures of the extracted data. Different tools and methods may be used to extract the data: each with their own nuances and challenges. Unless your organization has tightly controlled how data is generated in their operational systems already, expect that more than one tool or procedure may be required to perform data extraction effectively. This is definitely true if your organization expects to leverage data from external sources. Several activities may be required to transform data, including: Selecting inputs Separating input structures Applying standards (normalization) De-normalization Aggregating data Converting data Resolving corrupted or missing data Managing duplicate data Apply naming conventions Proper data management requires that each data element be clearly defined and relationships made. When data is coming from several data sources, some data elements will be the same and identify even more relationships which need to be addressed. Therefore, whenever a new data source is added to the environment, more work is required to ensure the data can be added to the warehouse properly. Data loading can be complicated if data is extracted from different systems at different times and frequencies. Data loading may be scheduled on a daily, weekly, or monthly basis. The growing interest in real-time data warehousing requires even greater attention to planning data updates, to the warehouse, and even back to operational systems. Copyright: The Art of Service 2008

Data Quality One purpose of Data Staging is to raise the quality of the data used in decision making: bad data will lead to bad decisions. Data Quality is influenced by: Inadequate database designs Aging of data Dummy or absent data Non-unique identifies Ineffective primary keys Violation of business rules Lack of policies and procedures Input errors Data Quality is an important concern for any data management activities and is at the heart of designing effective ETL activities in data warehousing. Copyright: The Art of Service 2008

Data Storage Organizations must establish the storage requirements for: Data staging Corporate data warehouse Individual data marts OLAP-based multidimensional databases Not every database is the same, and each database system will have different requirements on storage. Data warehouses are typically supported by a relational database management system (RDBMS), though if you are engaging a vendor to provide your data warehouse needs or even some business functions (ERP or CRM), they may be using a proprietary multidimensional database system (MDDB). The most well-known RDBMS solutions are Oracle, DB2, SQL, SAP/Sybase, and Teradata. Choosing the right database management system (DBMS) will generally entail a clear understanding of: User’s experience with DBMSs Types of Queries required Open Accessibility to Data Data Loads Management of Metadata Locations for Storing Data Potential Growth of Warehouse Copyright: The Art of Service 2008

Information Delivery The requirements for Information Delivery reside in expectations related to: Query types and frequencies Report types and frequencies Types of analysis Distribution of information Real-time requirements Applications for decision support Potential growth and expansion

Metadata The core of the data warehouse is its METADATA
In operational systems, data is provided to users in predefined interfaces and reports, but with a data warehouse, access to the data cannot be predefined as clearly because users may not know what they are looking for. As a result, users create and run their own queries, their own reports. To do this, they must have a good understanding of what data is available. This is possible because the data is described and catalogued: the result is a compilation of data called metadata. As the data warehouse expands and grows, the physical design and loading of the warehouse becomes more complicated. Metadata also provides information about the staging area, the logical structure of the databases, and any other information relevant to the design of the data warehouse. Metadata provides the foundation for administrating a data warehouse, by providing answers to the following questions and more: How to make changes to data? How to add a new data source to the environment? How data is cleaned? How to use new methods in data cleaning and transformation? How to add summary? How to control runaway queries? How to expand storage? How to schedule backups? How to keep data definitions current? Metadata drives every process of the data warehouse. Copyright: The Art of Service 2008

What is a Data Mart? A data mart is a subset of a data warehouse. A data warehouse will typically contain data relevant to the entire enterprise, while a data mart contains data relevant to a line of business or department within the enterprise. Deployment of data warehouses and data marts will usually take one of the following approaches: Top-down (data warehouse first, data marts second) Bottom-up (data marts first, data warehouse second) The inclusion of a data mart can be confusing to a person unfamiliar with data warehousing: this is because a data mart and a data warehouse is essentially the same thing, using the same technologies and products, and serving the same purpose. The difference between the two is simply the scope of the data contained and managed. Where data marts and data warehouses become an important topic for discussion is in their deployment, namely, which should be built first. There are two general approaches: top-down and bottom-up. The first approach, top-down, focuses on establishing the overall data warehouse first, the strategies and standards, and then focuses on establishing small data marts for key areas of the enterprise based on some definable structure, usually organizational. This approach assumes that no organization within the enterprise has a data warehouse solution in place. The advantage of the top-down approach is that it is truly architected for the enterprise and is not a conglomeration of smaller data marts: the rules and control of the data warehouse become centralized and consistent across the enterprise. Unfortunately, it takes longer to implement a data warehouse first and often has a high potential for failure because it requires a longer commitment of acceptance and skills from organizational entities who may not benefit immediately. The bottom-up approach establishes a data mart for each organization within the enterprise. This is the best approach for an enterprise whose internal organizations already have established data warehouses for themselves. Because the implementation focuses on the smaller data mart solutions, it tends to be faster and more manageable with a favorable return on investment. The problems with this approach relates to encouraging narrow perspectives on the data itself from each organization, and the increased possibility of redundant, inconsistent, and irreconcilable data. Copyright: The Art of Service 2008

Data Warehouse Architecture
There are five basic architectures in data warehousing: Centralized Data Warehouse – one data warehouse with no data marts. Independent Data Marts – several autonomous data marts with no central data warehouse. Federated Data Marts – several data marts operating under standardized controls with no central warehouse. Hub-and-Spoke – several data marts with a central data warehouse. Data-Mart Bus – several data marts are created and conform to the standards and controls of the original data mart. The architectures in data warehousing center on the relationships of data warehouses and data marts, and how they are used. For instance, the use of a data warehouse will assume the use of a centralized or hob-and-spoke architecture: in the first, no data marts exist and, in the second, the existing data marts will take their cues from the data warehouse. In both architectures, the data warehouse defines how the business looks and handles the data. In an independent data mart architecture, there is no central data warehouse and no governing policies for handling data at an enterprise. Each department supported by a data mart has the responsibility to define how the department looks and handles the data it needs: generally, this data is generated within the department and any data that is not is considered external. If the larger organization would like to impose enterprise-wide policies and controls to be complied by the individual data marts, the architecture is moving closer to a federated data mart architecture. While there is no central data warehouse, the consistent controls across multiple data marts makes sharing and integration of departmental data much easier to execute. The data mart bus architecture is a unique solution and is not recommended for organizations which may have an existing data warehouse solution in place. The idea of the data mart bus is to design a data mart based on a specific business subject (for instance, companies who are customer-oriented may create a data mart on the customer), then all other data marts are designed based on the controls, dimensions, and policies of the initial data mart. Therefore, if a product data mart is designed, it will have a customer-focused perspective to its structure. Typically, the adoption of the data mart bus architecture requires “creating the wheel” for the entire organization and is not suited if some departments are heavily invested in a current data mart/data warehouse solution. Copyright: The Art of Service 2008

Why Data Warehousing? What does a data warehouse provide the user?
Ability to run simple queries and reports against current and historical data Ability to perform “what if” scenarios Ability to iteratively query and analyze deeper into the data Ability to identify historical trends and apply them effectively to future situations. A data warehouse solution is designed to support the user, so the basic question to ask is, “What does the user get?” A satisfied user will naturally lead to a more effective business, especially when the right decisions are being made consistently. Copyright: The Art of Service 2008

Challenges in Data Acquisition
The typical challenges facing data acquisition activities are: Large number of data sources Disparate data sources External data sources Ongoing data feeds Different computing platforms Data replication Data integration Data cleansing Complex data transformations The challenges of data warehousing is better presented relative to its key functional components: data acquisition, data storage, and information delivery. With data acquisition, the first challenges relate to the numerous and potential diverse data sources where data will be extracted. The sheer number of potential data sources requires significant work to understand the data, its elements, the current structure, and how it can be integrated into the data warehouse solution. This effort becomes more complex as the configuration of the sources can be very diverse, such as computing platform, database structures, data types, interfacing, and a myriad of other concerns. In most cases, these configurations have some flexibility because the data sources are internal systems which the organization can create a workable solution from both sides of the interface. However, when dealing with external sources, the organization may not have any influence in getting the data in a compatible format from the providing party. While most of this work will occur before the initial extraction of data, some planning and definition has to be made to ensure that ongoing data feeds are handled properly. This may be easy if every data feed occurs at the same time every day, and at the same frequency over time,. But if integrating data from multiple data sources, it’s likely that even the scheduling of data feeds from individual systems will be complex. And at this point, we haven’t even touched the data. As data is extracted into the data staging area, many abnormalities may be present: Data may be missing, incomplete, or inaccurate Data may not conform to standard naming conventions Data may be duplicated between several sources Data from different sources may conflict New data types may be found Complex data types, such as images or recordings, may be corrupted In some cases, there may be several tools being used simultaneously to extract, cleanse, integrate, and prepare the data to be stored in the data warehouse. Each tool must be configured and used as part of the data acquisition package; sometimes with specific rules and procedures for their proper use which must be followed carefully. Copyright: The Art of Service 2008

Challenges in Data Storage
The typical challenges facing data storage activities are: Large data volumes Large data sets New data types Data storage in staging area Multiple index types Parallel processing Data archiving Tool compatibilities If everything goes well in data acquisition, data is ready to be stored and made available to user. However, the challenges do not disappear. As businesses become more dependent on its data and learns to use data more effectively, the demand on the data increases. It has been said that 90% of the world’s data has been created in the last two years: considering that this estimate is even slightly true of any single organization, it means the demand on data storage is rising exponentially for every organization. It means the volumes of data is increasing and with interest in Big Data, the data itself is increasing in size. While many organizations may have sufficient knowledge in managing structured data, the evolving techniques for handling unstructured data requires organizations to reconsider how they handle new data types. As to the actual challenges to physical storage, a data warehouse solution must adequately plan and provide capacity for data staging, data archives, and metadata, and raw data, as well as the storage requirements for various analysis techniques using storage to support parallel processing. Storage management solutions and requirements are also factors for data storage: for instance, a company may utilize any number of storage architectures, such as RAID to meet business requirements. So what happens to the data warehouse storage requirements when some of the data must be replicated between several locations in real time to ensure consistent and immediate access from the business across the world? Some of the metadata created provides information about where the data is stored physically; losing this metadata is relative to losing all the data. How this metadata is created, stored, and recovered are concerns that must be addressed when developing the data warehouse solution. Copyright: The Art of Service 2008

Challenges in Information Delivery
The typical challenges facing information delivery activities are: Multiple user types Multiple query types Complex queries OLAP Multidimensional analysis Web-enablement Metadata management Tools from multiple vendors If successful in acquiring the data and storing it, the next set of concerns relate to how that data is made available to users. One slight concern may be that not all data is appropriate to all users, even within a single department. Understanding the different types of users, their authorities and responsibilities, and what data they require to fulfill their duties is the first step to understanding how to support them using the data warehouse solution. The next step is understanding and providing multiple ways of accessing the data: from simple queries to multidimensional analysis, from active searches to passive generation of reports. As users become more accustomed to using the web or demand the ability to access data from web-enabled devices such as tablets and smartphones, organizations have been looking at methods and tools to accommodate. The path to web-enablement is often made easier though data warehouses: it is sometimes easier to get to the data generated from the legacy system than it is to access the legacy system from the web. Still, to ensure the data can be accessed though a web interface, some further transformation may be required of the data during the data acquisition phase, or some additional requirements for data storage may be in place. One simple assumption for data warehouses is that users will access the metadata before they access the data itself. This places even further burden on generating and managing metadata effectively for the user. As users begin to search, mine, and analyze the data, new relationships may be created between the data: relationships which are documented as metadata and made available in the future. As a result, the more users identifying more relationships will generate more metadata, thus increasing the demand on management of the metadata even further. The interfaces used by users to access the data can come from multiple vendors with different expectations on how data is presented, similar in scope and complexity to extracting data from data sources. In fact, it is quite possible that data may be extracted from an operational system, processed though the data warehouse, and made available to the same operational system. If every system reads the data differently, some thought needs to be taken to ensuring that the data can be read by all systems appropriately. Add to this situation whether the accessing system will “pull” data from the data warehouse when it needs it, or if the data warehouse is expected to “push” the data to the system. Copyright: The Art of Service 2008

Relevant Data Warehouse Standards
Relevant standards for data warehousing, specifically metadata, are provided through: Meta Data Coalition Object Management Group OLAP Council for Multi-dimensional Application Programmers Interface (MDAPI) There are no widely accepted standards for data warehousing, though its many components may have applicable standards: for instance, storage management has standards for different implementations and architectures. The importance of metadata in data warehousing is further exposed by the number of standards relevant to metadata management. The two organizations who have led the change to standardize metadata approaches have been the Meta Data Coalition and the Object Management Group. One discipline tightly related to data warehousing and metadata management is online analytical processing (OLAP): the OLAP Council is responsible for standards in this area. Copyright: The Art of Service 2008

Basic Project Plan The basic plan for a data warehouse project is:
Planning Defining requirements Design Build Deploy Maintain The basic plan for implementing a data warehouse will be expanded in the document, Developing Data Warehouse Capabilities. Copyright: The Art of Service 2008

The Toolkit The Toolkit is designed to be holistic to the enterprise’s relationship with data, not just data warehousing. As part of its scope, a second presentation is available to introduce Data Analytics and Data Mining, which is related to the second step of Business Intelligence. The goal of the Data Warehouse/Analytics Toolkit is to define the contributing factors, major components, and their relationships, while providing the basic tools to take action based on the organization’s needs. Copyright: The Art of Service 2008

Moving Forward The participant can take two directions in using the toolkit at this point. To continue with the data warehouse discussion, the next document of interest is, Developing Warehouse Capabilities, which is intended to be a step-by-step guide in creating a Big Data foundation in your organization. To learn more about data-related activities within an enterprise, see the presentation, Introduction to Data Analytics and Mining. . Multiple templates have been created to support the process and aid organizations in their efforts to improve their Data Warehouse and Data Analytic capabilities. Copyright: The Art of Service 2008

Understanding Data Warehousing

Similar presentations

Presentation on theme: "Understanding Data Warehousing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding Data Warehousing

Similar presentations

Presentation on theme: "Understanding Data Warehousing"— Presentation transcript:

Similar presentations

About project

Feedback