Zhangxi Lin Texas Tech University ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University
Projects Students will build up a Hadoop system and explore/visualize a Hadoop based data warehouse. Students are divided into three cohorts. Cohort 1 uses Pentaho for data analysis, Cohort 2 uses Tableau for data analysis, and the third cohort will work on self-selected business intelligence topic. Each cohort may home 2-4 teams with no more than 12 students in total, and each team is composed of 2-4 members. Deliverables include a team presentation of 15 minutes and a term report in 6-10 pages.
Project contents Each team will identify a big data topic and find needed data. The dataset is not necessarily to be Big enough, but representative. A data warehouse using either SQL Server, or Hadoop is fine. Data analysis/visualization must be well done. The report/presentation will cover the following points: Business background Data description Data model design ETL Analytical results
Topics Data warehousing Publicly available big data services No: Topic Components 1 Data warehousing Focus: Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr 2 Publicly available big data services Focus: tools and free resources Hortonworks, CloudEra, HaaS, EC2 3 MapReduce & Data mining Focus: Efficiency of distributed data/text mining Mahout, H2O, MLlib, R, Python 4 Big data ETL Focus: Heterogeneous data processing across platforms Kettle, Flume, Sqoop, Impala 5 System management: Focus: Load balancing and system efficiency Oozie, ZooKeeper, Ambari, Loom, Ganglia, Mesos 6 Application development platform Focus: Algorithms and innovative development environments Tomcat, Neo4J, Taitan, GraphX, Pig, Hue 7 Tools & Visualizations Focus: Features for big data visualization and data utilization. Pentaho, Tableau, Qlik Saiku, Mondrian, Gephi, 8 Streaming data processing Focus: Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro
Data Warehousing Methodology - Implementing data warehouse systematically Data Warehousing Methodology 6
Data Warehouse Development Methods Data warehouse development approaches Kimball Model: Data mart approach Data marts - EDW Inmon Model: EDW approach EDW – Data Marts Which model is better? There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse 7
Comparison Kimball Model Inmon Model Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is provisioned from Datamarts (DM) as and when they are available or required. The Datamarts are sourced from OLTP systems are usually relational databases in Third normal form (3NF). The Data Warehouse which is central to the model is a de-normalized star schema. The OLAP cubes are built on this DW. Inmon Model Inmon’s model follows a top-down approach. The Data Warehouse (DW) is sourced from OLTP systems and is the central repository of data. The Data Warehouse in Inmon’s model is in Third Normal Form (3NF). The Datamarts (DM) are provisioned out of the Data Warehouse as and when required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes are built.
Strengths and Weaknesses Scalable vs. structural Kimball’s model is more scalable because of the bottom-up approach and hence you can start small and scale-up eventually. The ROI is usually faster with Kimball’s model. Because of this approach it is difficult to created re-usable structures/ ETL for different data marts. On the other hand Inmon’s model is more structured and easier to maintain while it is rigid and takes more time to build. The significant advantage of Inmon’s model is because the DW is in 3NF; it is easier to build data mining models. Both Kimball and Inmon models agree and emphasis that DW is the central repository of data and OLAP cubes are built of de-normalized star schemas. In conclusion, when it comes to data modeling, it is irrelevant which camp you belong to as long as you understand why you are adopting a specific model. Sometimes it makes sense to take a hybrid approach.
General Data Warehouse Development Approaches “Big bang” approach Incremental approach: Top-down incremental approach Bottom-up incremental approach Warehouse Development Approaches The most challenging aspect of data warehousing lies not in its technical difficulty, but in choosing the best approach to data warehousing for your company’s structure and culture, and dealing with the organizational and political issues that will inevitably arise during implementation. Among the different approaches to developing a data warehouse are: “Big bang” approach Incremental approach Top-down incremental approach Bottom-up incremental approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin 11 11
“Big Bang” Approach Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts “Big Bang” Approach Historically IT departments attempted to provide enterprisewide data warehouse implementations in a single project approach. Data warehouse development is a huge task, and it is a mistake to assume that the solution can be built all at once. The time required to develop the warehouse often means that user requirements and technologies change before the project is completed. In this approach, you perform the following: Analyze the entire information requirement for the organization Build the enterprise data warehouse to support these requirements Build access, as required, either directly or by subsetting to data marts ISQS 6339, Data Mgmt & BI, Zhangxi Lin 12 12
Incremental Approach to Warehouse Development Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Build Incremental Approach The incremental approach manages the growth of the data warehouse by developing incremental solutions that comply with the full-scale data warehouse architecture. Rather than starting by building an entire enterprisewide data warehouse as a first deliverable, start with just one or two subject areas, implement them as scalable data mart and roll them out to your end users. Then, after observing how users are actually using the warehouse, add the next subject area or the next increment of functionality to the system. This is also an iterative process. It is this iteration that keeps the data warehouse in line with the needs of the organization. Benefits Delivers a strategic data warehouse solution through incremental development efforts Provides extensible, scalable architecture Supports the information needs of the enterprise organization Quickly provides business benefit and ensures a much earlier return of investment Allows a data warehouse to be built based on a subject or application area at a time Allows the construction of an integrated data mart environment Iterative Production ISQS 6339, Data Mgmt & BI, Zhangxi Lin 13 13
Top-Down Approach Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework Top-Down Incremental Approach Advantages This approach has the following advantages: Provides a relatively quick implementation and payback. Typically, the scoping, definition study, and initial implementation are scaled down so that they can be completed in six to seven months. Offers significantly lower risk because it avoids being as analysis heavy as the “big bang” approach Emphasizes high-level business needs Achieves synergy among subject areas. Maximum information leverage is achieved as cross-functional reporting and a single version of the truth are made possible Disadvantages This approach has the following disadvantages: Requires an increase in up-front costs before the business sees any return on their investment Is difficult to define the boundaries of the scoping exercise if the business is global May not be suitable unless the client needs cross-functional reporting ISQS 6339, Data Mgmt & BI, Zhangxi Lin 14 14
Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework Bottom-Up Incremental Approach This approach is similar to the top-down approach but the emphasis is on the data rather than the business benefit. Here, IT is in charge of the project either because IT wants to be in charge or the business has deferred the project to IT. Advantages This approach has the following advantages: This is a “proof of concept” type of approach, therefore it is often appealing to IT. It is easier to get IT buy-in for this approach because it is focused on IT. Disadvantages This approach has the following disadvantages: Because the solution model is typically developed from source systems and these source systems will have encapsulated within them the current business processes, the overall extensibility of the model will be compromised. IT staff is often the last to know about business changes—IT could be designing something that will be out of date before they complete its delivery. As the framework of definition in this approach tends to be much narrower, often a significant amount of reengineering work is required for each increment. ISQS 6339, Data Mgmt & BI, Zhangxi Lin 15 15