Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Census Bureau DRIS Decennial Response Integration System Date: 07/30/2007 System IPT Decennial Systems Database Modernization Support.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
3rd Annual Plex/2E Worldwide Users Conference 13A Batch Processing in 2E Jeffrey A. Welsh, STAR BASE Consulting, Inc. September 20, 2007.
Advanced SQL Topics Edward Wu.
Chapter 1: The Database Environment
Chapter 1 The Study of Body Function Image PowerPoint
Relational Database and Data Modeling
David Burdett May 11, 2004 Package Binding for WS CDL.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING Think Distributive property backwards Work down, Show all steps ax + ay = a(x + y)
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
1 Term 2, 2004, Lecture 5, Physical DesignMarian Ursu, Department of Computing, Goldsmiths College Physical Design 3.
ZMQS ZMQS
Break Time Remaining 10:00.
Factoring Quadratics — ax² + bx + c Topic
CS 440 Database Management Systems RDBMS Architecture and Data Storage 1.
Turing Machines.
Information Systems Today: Managing in the Digital World
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
PP Test Review Sections 6-1 to 6-6
ABC Technology Project
1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.
Chapter 7 Working with Databases and MySQL
11 Copyright © Oracle Corporation, All rights reserved. Managing Tables.
Hash Tables.
R ELATIONAL M ODEL TO SQL Data Model. 22 C ONCEPTUAL D ESIGN : ER TO R ELATIONAL TO SQL How to represent Entity sets, Relationship sets, Attributes, Key.
© Paradigm Publishing, Inc Access 2010 Level 1 Unit 1Creating Tables and Queries Chapter 2Creating Relationships between Tables.
Yong Choi School of Business CSU, Bakersfield
Microsoft Access.
Database Modeling Past and Present
Displaying Data from Multiple Tables
Databases and Database Management Systems
Chapter Information Systems Database Management.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
Vanderbilt Business Objects Users Group 1 Reporting Techniques & Formatting Beginning & Advanced.
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Access Tables 1. Creating a Table Design View Define each field and its properties Data Sheet View Essentially spreadsheet Enter fields You must go to.
Benchmark Series Microsoft Excel 2013 Level 2
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
GIS Lecture 8 Spatial Data Processing.
Note: A bolded number or letter refers to an entire lesson or appendix. A Adding Data Through a View ADD_MONTHS Function 03-22, 03-23, 03-46,
© 2012 National Heart Foundation of Australia. Slide 2.
Indexing.
Addition 1’s to 20.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Test B, 100 Subtraction Facts
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Week 1.
Number bonds to 10,
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Essential Cell Biology
14 Databases Foundations of Computer Science ã Cengage Learning.
Select a time to count down from the clock above
© Paradigm Publishing, Inc Access 2010 Level 2 Unit 2Advanced Reports, Access Tools, and Customizing Access Chapter 8Integrating Access Data.
Management Information Systems, 10/e
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Presentation transcript:

Census Bureau DRIS Date: 01/16/2007

2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook Data Overlook Two Approaches Two Approaches First Approach First Approach Data Distribution Data Distribution Advantages Advantages Disadvantages Disadvantages

3 Second Approach Second Approach Basic Modeling Basic Modeling Advantages Advantages Advance Work Advance Work Care needed Care needed Our Recommendation Our Recommendation Tasks Tasks

4 Data modeling Conversion of data from Legacy (Fortran) to RDBMS (Oracle) Conversion of data from Legacy (Fortran) to RDBMS (Oracle) Hardware/software Hardware/software Sun V890/E12K, OS Solaris 5.7,5.8,5.9,5.10 Sun V890/E12K, OS Solaris 5.7,5.8,5.9,5.10 Database - Oracle 10g Database - Oracle 10g Oracle designer / Erwin Oracle designer / Erwin

5 Current datafile Big datafile Geo Census Base Data Legacy process Data modeling Oracle db Reports Data Feeds Data updates Pl/SQL, Shell, C, ETL tool

6 Current Dataload UCNM data UCNM data Fortran format Fortran format One big file w/ 180 M records One big file w/ 180 M records Record length is 1543 bytes Record length is 1543 bytes Most of the fields are varchar2 Most of the fields are varchar2 Many fields are blank/no data Many fields are blank/no data Performance too poor in Oracle Performance too poor in Oracle

7 Data overlook (approx) State of NY State of NY State of CA State of CA State of TX State of TX District of Columbia District of Columbia Delaware Delaware Connecticut Connecticut 20 M 31 G 20 M 31 G 34 M 52 G 34 M 52 G 25 M 38 G 25 M 38 G 500 K 750 M 500 K 750 M 1 M 1.5 G 1 M 1.5 G

8 Two approaches First Approach First Approach Break datafile on the basis of data E.g. RO level (12) E.g. RO level (12) State level (54-56), including DC, Puerto Rico etc. State level (54-56), including DC, Puerto Rico etc. Second Approach Second Approach Break datafile into multiple tables with change in field definitions using relational model

9 First approach Break datafile on the basis of data Current datafile Table_CATable_NYTable_XXTable_YYTable_54

10 Data distribution Uneven data distribution Uneven data distribution Big data tables will be 30+ G Big data tables will be 30+ G Small data tables will be close to < 1 G Small data tables will be close to < 1 G

11 Advantages State level queries will be faster than current State level queries will be faster than current If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56) If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56)

12 Disadvantages Too many tables Too many tables Many fields are empty and varchar2(100) Many fields are empty and varchar2(100) No normalization No normalization Existing queries need to be changed a lot Existing queries need to be changed a lot No normalization technique is used. No normalization technique is used. For small tables, query will run fast but for big tables, there will be a lot of overhead Operational tables will be same in number Operational tables will be same in number Too complicated to run queries, may confuse users while joining main and operational tables Too complicated to run queries, may confuse users while joining main and operational tables

13 Second approach Break datafile into few relational tables with change in field definitions Current datafile Table1 Table2 Table4 Table3 MAFID

14 Basic Modeling Database design/logical and physical Database design/logical and physical Relations will be defined based on a primary key Relations will be defined based on a primary key In this case, it will be MAFID, which is unique In this case, it will be MAFID, which is unique varchar2(100) fields will be converted to smaller fields, say varchar2(60) or smaller/based on actual field lengths varchar2(100) fields will be converted to smaller fields, say varchar2(60) or smaller/based on actual field lengths All fields will be mapped with at least one of the fields in the new tables All fields will be mapped with at least one of the fields in the new tables Data will be inserted in small multiple tables Data will be inserted in small multiple tables

15 Advantages Faster Faster Queries Queries Updates Updates Deletes Deletes Additions Additions Less maintenance Less maintenance Same approach can be used for transactional/operational data Same approach can be used for transactional/operational data

16 Advance work Identify each and every field of UNM data Identify each and every field of UNM data Check/Define field lengths of each field Check/Define field lengths of each field Map every field to new table field Map every field to new table field Can some fields be merged together? Can some fields be merged together? If yes, identify those If yes, identify those Define tables and relationships Define tables and relationships Break and load data into these tables Break and load data into these tables

17 Care needed Current datafile will be broken into multiple datafiles for data processing Current datafile will be broken into multiple datafiles for data processing Load one by one datafile into tables Load one by one datafile into tables Making sure that all datafiles are loaded into multiple tables Making sure that all datafiles are loaded into multiple tables No data is missing from the base table No data is missing from the base table

18 Our Recommendation ** Second Approach ** ** Second Approach ** Why ? Why ? Data distribution will be uniform Data distribution will be uniform Less unwanted data is moved to separate tables Less unwanted data is moved to separate tables This will reduce overhead on the queries of any updates This will reduce overhead on the queries of any updates Existing queries can be used by little modifications Existing queries can be used by little modifications Less maintenance Less maintenance Additional data like from RPS can be easily uploaded using same queries Additional data like from RPS can be easily uploaded using same queries

19 Tasks Design database using data modeling tool/ Oracle designer / Erwin etc. Design database using data modeling tool/ Oracle designer / Erwin etc. Create test data from original datafile Create test data from original datafile Load test data into database tables Load test data into database tables Create test scripts to check data consistency Create test scripts to check data consistency Check indexes for required queries Check indexes for required queries Test old data vs. new data Test old data vs. new data

20 Continued… Break data into small files Break data into small files Load full data into tables Load full data into tables Unit test on data for consistency Unit test on data for consistency Run queries on the database Run queries on the database If needed, fine tune database If needed, fine tune database Use same approach for transactional data like RPS data Use same approach for transactional data like RPS data

21 THE END