A Universal File Flattener

A Universal File Flattener
David A. Vandenbroucke U.S. Department of Housing & Urban Development Good afternoon. I am Dav Vandenbroucke from the U.S. Department of Housing & Urban Development. I’m here to tell you about a program that I wrote that will take any set of relational SAS tables and transform them into a single flat table, with a minimum of user intervention. All you need to tell the program is where to find the tables, what key fields are, and where to put the results. I appreciate the fact that I’m all that stands between you and the cocktail party. I hope you find that the presentation is worthwhile. I do need to mention that I am speaking for myself today, and nothing should be considered to be official policy of the United States government. Copyright © 2010, SAS Institute Inc. All rights reserved.

About the presenter Dav Vandenbroucke started using SAS in 1977, submitting jobs on punch cards. For the past twenty-something years has been a designer, analyst, and user-supporter of the American Housing Survey, a longitudinal survey with a 30+ year panel. He has pioneered methods of measuring housing affordability and linking surveys to trace changes in the housing stock. He lives in Alexandria, Virginia and has an 18-year-old car with 26,000 miles on it. #SASGF Copyright © 2016, SAS Institute Inc. All rights reserved.

Key Points Turn a set of relational tables into a single table
Automatically enumerate which tables to include Automatically detect one-to-many relationships Use two passes of PROC TRANSPOSE to replicate sets of fields Create Excel report This program uses the datasets procedure to get a list of tables in the specified input library. Thus, the user does not have to give the program such a list. It puts that list into a macro variable, which it treats as a horizonal array so that it can loop through each table to see if it has a one-to-many relationsthip with the the key values. It then uses two passes of PROC TRANSPOSE to restructure the table into a one-to-one table, with as many replicate sets of fields as are needed to accomodate the “many” part of the relationship. The first pass essentially disassembles the table into its smallest elements, and then the second past reassembles those elements into a flattened table. It merges all of the table into a single output table. At the end, it uses ODS to send a report to a spreadsheet, so that you can see what happened. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Relational Database One-to Many Relationship Person Street FNam Sex
Maple Moe M Mae F Larch Larry Leila Louie Cedar Curly Housing Street Tenure Type Maple Own House Larch Rent Apt Cedar Mobile One-to Many Relationship A relational database is a set of tables that can be linked, or related, by some combination of key fields. In this simple example, we have two tables, Housing and Person. Housing has three rows, each representing a housing unit. The key field is the street name, Maple, Larch, and Cedar. There are two fields. Tenure shows whether the unit is owned or rentec, while Type shows the structure of the unit: house, apartment, or mobile home. The Person table shows the people who live in the housing units. It contains two fields, for the person’s first name and for his or her gender. It has more than one row for some of the units, because several people live there. It can be related to the housing table by means of the street name. In this example, Street is the only key field. However, some databases use a combination of fields to denote a unique entity. For example, a housing unit might require a street number and an apartment number, which could be stored in separate fields. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Flattened File Street Tenure Type FNam1 Sex1 FNam2 Sex2 FNam3 Sex3
Maple Own House Moe M Mae F . Larch Rent Apt Larry Leila Louie Cedar Mobile Curly A flat version of this dabase looks like this. There is one row for each value of the key field. The fields on the left are from the Housing table. To the right of are replicate sets of fields, representing the data from the Person table. There are three such sets, in order to accomodate the Larch address, which has three persons. Some of the fields in the Maple and Cedar rows are filled with missing values, because neither of them had as many persons as Larch. The file flattener program identifies those tables that have one-to-many relationships with the key fields and then constructs enough sets of replicate fields to accomodate all of the data. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

American Housing Survey (partial list of tables)
Name Description Obs Fields One to... HOUSE Housing units 84,355 760 1 PERSON Demographic data 148,342 82 20 HOMIMP Home Improvement 46,641 6 34 OMOV Out-movers topical 2,351 8 3 OWNER Resident landlords 27,570 REPWGT Replicate weights 484 RMOV Recent movers 17,678 23 TOPICAL Unit-level topical 114 You might wonder why one might want an automated program to flatten a file. That example doesn’t look very difficult. Here is a more complex example. I help run the American Housing Survey. This is a partial list of the tables in our 2013 database. There are two more tables, but they didn’t fit on the slide. This databasehas many tables, observations, and fields. Some of the tables are one to one, but some are one to many. While this is complcated itself, none of these characteristics are static. The 2015 database, when we release it will have different tables, observations, fields, and relationships. I wrote the file flattener so that we could produce a flat file with each year’s data, without having to customize the program for all of these changes. Instead, the program gets the information it needs from the databaset itself. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

User Configuraton %LET InName = C:\here ; /* input directory */ %LET OutName = C:\there; /* output directory */ %LET SpreadDir = C:\there; /* report directory*/ %LET DSet = Flatfile1; /* name of flat file*/ %LET Keys = field1 field2; /* list of key fields*/ The user needs to give the program five pieces of information by assigning values to these macro variables. This is all you need to do before running the program. InName is the path to the library where the relational database is stored. It should not be the SAS work directory. OutName is the path to the library where the flat file will be stored. It may be the same as the input library, and it may be the work library. SpreadDir is the path to where the report spreadsheet will be saved. It can be anything you want. DSet is the name of the flat file. It will also be used to name the spreadsheet. Keys is the list of fields that will be used to link the relational files. It should be a space delimited list. It can be any length, and it can mix numeric and character fields, in any order. As I mentioned in my example, you sometimes need a combination of several fields to identify a unique entity in your database. All you have to do is list them here. The program takes care of the rest. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Detecting and Processing Tables
PROC DATASETS to list tables in a library _NULL_ DATA step to put list in a macro variable (ex: &FileList. = HOUSE PERSON) %DO %UNTIL loop to process each table in FileList PROC SQL to count repetitions of key variables in a table (ex: &MaxCount. = 3) Because I have only twenty minutes of your time, I am going to skip over some of the details of how the program works. It is explained in detail in the paper. The program automatically enumerates the tables in the input library and then loops through all of them. For each one, it determines whether it has a one-to-one or one-to-many relationship with the key field. It sets the one-to-one tables aside and processes the one-to-many tables, as we’ll see in a moment. Enumerating and evaluating the input tables uses the SAS resources show in this slide. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Two-Stage Flattening--Preliminaries
Add field, RecNo, to count replicates of key values with a DATA step Character and numeric data processed separately Table restructured with two uses of PROC TRANSPOSE Disassemble table into one record per cell Reassemble into flat file If a table has more than one record for each set of key values, then it has to be flattened. This is where the main work of the program takes place. Before we start the transpositions, we add a variable, RecNo, to count the replicates of the key values in the table. We do this with a DATA step that processes the table BY the keys and resets the counter every time the key variables change. A quirk of PROC TRANSPOSE is that if a table contains only numeric data, the data type is maintained, but if the table contains a mixture of types, the output table is all character. We don’t want to lose the numeric variables in the table, and so this step is applied to the numeric and character variables separately. We reassemble the pieces at the end of the process. In order to flatten the table, it has to go throughy two passes of PROC Transpose. We need two passes of PROC TRANSPOSE. As we will see, the first one takes the table apart, while the second one resassembles it as a flat file. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

First Transposition (Person Table)
Street FNam Sex RecNo Maple Moe M 1 Mae F 2 Larch Larry Leila Louie 3 Cedar Curly Street RecNo _NAME_ COL1 Maple 1 FNam Moe Sex M 2 Mae F Larch Larry Leila PROC TRANSPOSE... BY &Keys. RecNo; The first transposition uses the keys and the replicate number as BY variables. Rather than go through the syntax, which you can read in the paper, I’ll show you what the first transposition does. On the left is the one-to-many table that we saw earlier, with that RecNo variable added. On the right is the output from PROC TRANSPOSE. Each data cell in the original table is transformed into a row in the output table. Thus, the first row is now three: for FNam, Sex, and Street. The key value and replicate number are carried along. The SAS-generated field, _NAME_, tells you the name of the original field, and the SAS-generated field COL1 gives you the value. Maple’s second row occupies the next three rows, followed by Larch’s first row. This table extends all the way down to Cedar’s data, and so it has 18 rows in all. This step separates the table into its smallest elements and tags each with its key value and with its replicate number. The next step is to reorganize the table so that the key values form the rows and the replicate numbers label the columns. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Second Transposition (Person Table)
Street FNam1 Sex1 Street1 FNam2 Sex2 Street2 FNam3 Sex3 Street3 Maple Moe M Mae F . Larch Larry Leila Louie Cedar Curly PROC TRANSPOSE... BY &Keys.; ID _name_ RecNo; VAR col1; The second transposition takes the output of the FNam and reshapes it using only the keys as the BY variables. As the code fragment shows, it uses the _name_ field and the replicate number to create field names that are a combination of the original names, with an added number. There is one record per key value. Because Maple and Cedar don’t have the maximum number of replicates, some fields in their rows are filled by missing values. You’ll note that PROC TRANSPOSE gives us replicates of the key field, street, along with the data. We don’t want that, because it’s redundant. The program runs some cleanup code after this point in order to delete the extraneous variables. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Final Steps Reassemble character & Numeric parts
Cleanup extraneous fields Merge all the tables by the key fields Produce report (Excel) At this point, all the interesting work is done. We have a set of tables, all of which are one-to-one with the keys. We need to connect the numeric and character parts of the flattened tables that we had separated to do the transpositions. We also need to delete those redundant key variables. The program does some interesting things with macro variables while doing that, but I don’t have time to discuss them. After the clean up, it’s just a matter of assembling the full table using MERGE BY on all the components. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Notes and Cautions Performance issues Hierarchical databases
SAS features Field labels preserved Field formats, lengths not (always) preserved Duplicate field names, names ending in numerals The paper discusses some matters you should be aware of. The first pass of proc transpose can produce some very big files, which will take time to process and require storage space in the process. If you’re dealing with “big data,” you probably don’t want to flatten the database anyway. There are certain kinds of databases that have a more complicated structure than simply relational. I discuss some workarounds if you want to flatten those. There are some SAS-specific issues. Field labels are preserved, but value labels and field lengths may not be. The merging and flattening process can overwrite fields because of name conflicts, and that is especially a danger if you have fields that end in numerals. If those sound like concerns for you, be sure to read that part of the paper. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

Summary Automated solution to flattening a database
Obtains parameters from the data tables Runs in BASE SAS Illustrates handy tricks in macro coding, PROC DATASETS, SQL, TRANSPOSE, DATA step I have presented a program that is an automated solution to flattening any relational database. It is designed to get the parameters it needs from the data tables themselves. The user does not have to supply table names, number of replicates, or field names. It runs entirely in BASE SAS, and so any SAS user can run this program. If you read the source code, you will find a number of handy tricks involving macro processing, macro variables, and the procedures used in the program. I certainly did not invent all of these myself, but the program illustrates how they can be assembled into a useful whole. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

U.S. Dept. Housing & Urban Dev. 451 7th St SW Washington, DC 20410
Dav Vandenbroucke Senior Economist U.S. Dept. Housing & Urban Dev. 451 7th St SW Washington, DC 20410 Thank you for sticking with me this late in the day. I’ll be happy to answer any questions. #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

A Universal File Flattener

Similar presentations

Presentation on theme: "A Universal File Flattener"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Universal File Flattener

Similar presentations

Presentation on theme: "A Universal File Flattener"— Presentation transcript:

Similar presentations

About project

Feedback