Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

Census Bureau DRIS Date: 01/16/2007

2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook Data Overlook Two Approaches Two Approaches First Approach First Approach Data Distribution Data Distribution Advantages Advantages Disadvantages Disadvantages

3 Second Approach Second Approach Basic Modeling Basic Modeling Advantages Advantages Advance Work Advance Work Care needed Care needed Our Recommendation Our Recommendation Tasks Tasks

4 Data modeling Conversion of data from Legacy (Fortran) to RDBMS (Oracle) Conversion of data from Legacy (Fortran) to RDBMS (Oracle) Hardware/software Hardware/software Sun V890/E12K, OS Solaris 5.7,5.8,5.9,5.10 Sun V890/E12K, OS Solaris 5.7,5.8,5.9,5.10 Database - Oracle 10g Database - Oracle 10g Oracle designer / Erwin Oracle designer / Erwin

5 Current datafile Big datafile Geo Census Base Data Legacy process Data modeling Oracle db Reports Data Feeds Data updates Pl/SQL, Shell, C, ETL tool

6 Current Dataload UCNM data UCNM data Fortran format Fortran format One big file w/ 180 M records One big file w/ 180 M records Record length is 1543 bytes Record length is 1543 bytes Most of the fields are varchar2 Most of the fields are varchar2 Many fields are blank/no data Many fields are blank/no data Performance too poor in Oracle Performance too poor in Oracle

7 Data overlook (approx) State of NY State of NY State of CA State of CA State of TX State of TX District of Columbia District of Columbia Delaware Delaware Connecticut Connecticut 20 M 31 G 20 M 31 G 34 M 52 G 34 M 52 G 25 M 38 G 25 M 38 G 500 K 750 M 500 K 750 M 1 M 1.5 G 1 M 1.5 G

8 Two approaches First Approach First Approach Break datafile on the basis of data E.g. RO level (12) E.g. RO level (12) State level (54-56), including DC, Puerto Rico etc. State level (54-56), including DC, Puerto Rico etc. Second Approach Second Approach Break datafile into multiple tables with change in field definitions using relational model

9 First approach Break datafile on the basis of data Current datafile Table_CATable_NYTable_XXTable_YYTable_54

10 Data distribution Uneven data distribution Uneven data distribution Big data tables will be 30+ G Big data tables will be 30+ G Small data tables will be close to < 1 G Small data tables will be close to < 1 G

11 Advantages State level queries will be faster than current State level queries will be faster than current If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56) If the data is separated by RO, the data will be more distributed w/ less tables (close to 12 instead 54-56)

12 Disadvantages Too many tables Too many tables Many fields are empty and varchar2(100) Many fields are empty and varchar2(100) No normalization No normalization Existing queries need to be changed a lot Existing queries need to be changed a lot No normalization technique is used. No normalization technique is used. For small tables, query will run fast but for big tables, there will be a lot of overhead Operational tables will be same in number Operational tables will be same in number Too complicated to run queries, may confuse users while joining main and operational tables Too complicated to run queries, may confuse users while joining main and operational tables

13 Second approach Break datafile into few relational tables with change in field definitions Current datafile Table1 Table2 Table4 Table3 MAFID

14 Basic Modeling Database design/logical and physical Database design/logical and physical Relations will be defined based on a primary key Relations will be defined based on a primary key In this case, it will be MAFID, which is unique In this case, it will be MAFID, which is unique varchar2(100) fields will be converted to smaller fields, say varchar2(60) or smaller/based on actual field lengths varchar2(100) fields will be converted to smaller fields, say varchar2(60) or smaller/based on actual field lengths All fields will be mapped with at least one of the fields in the new tables All fields will be mapped with at least one of the fields in the new tables Data will be inserted in small multiple tables Data will be inserted in small multiple tables

15 Advantages Faster Faster Queries Queries Updates Updates Deletes Deletes Additions Additions Less maintenance Less maintenance Same approach can be used for transactional/operational data Same approach can be used for transactional/operational data

16 Advance work Identify each and every field of UNM data Identify each and every field of UNM data Check/Define field lengths of each field Check/Define field lengths of each field Map every field to new table field Map every field to new table field Can some fields be merged together? Can some fields be merged together? If yes, identify those If yes, identify those Define tables and relationships Define tables and relationships Break and load data into these tables Break and load data into these tables

17 Care needed Current datafile will be broken into multiple datafiles for data processing Current datafile will be broken into multiple datafiles for data processing Load one by one datafile into tables Load one by one datafile into tables Making sure that all datafiles are loaded into multiple tables Making sure that all datafiles are loaded into multiple tables No data is missing from the base table No data is missing from the base table

18 Our Recommendation ** Second Approach ** ** Second Approach ** Why ? Why ? Data distribution will be uniform Data distribution will be uniform Less unwanted data is moved to separate tables Less unwanted data is moved to separate tables This will reduce overhead on the queries of any updates This will reduce overhead on the queries of any updates Existing queries can be used by little modifications Existing queries can be used by little modifications Less maintenance Less maintenance Additional data like from RPS can be easily uploaded using same queries Additional data like from RPS can be easily uploaded using same queries

19 Tasks Design database using data modeling tool/ Oracle designer / Erwin etc. Design database using data modeling tool/ Oracle designer / Erwin etc. Create test data from original datafile Create test data from original datafile Load test data into database tables Load test data into database tables Create test scripts to check data consistency Create test scripts to check data consistency Check indexes for required queries Check indexes for required queries Test old data vs. new data Test old data vs. new data

20 Continued… Break data into small files Break data into small files Load full data into tables Load full data into tables Unit test on data for consistency Unit test on data for consistency Run queries on the database Run queries on the database If needed, fine tune database If needed, fine tune database Use same approach for transactional data like RPS data Use same approach for transactional data like RPS data

21 THE END

Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

Similar presentations

Presentation on theme: "Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

Similar presentations

Presentation on theme: "Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook."— Presentation transcript:

Similar presentations

About project

Feedback