Dexter The Missouri Census Data Center’s Data Extraction Utility Data Extraction Utility John Blodgett: OSEDA, University of MissouriOSEDA Rev.14May2007,

Slides:



Advertisements
Similar presentations
Things to See and Do On the Missouri Census Data Center Web Site JGB, May, 2008.
Advertisements

Accessing Large Table Files With Dexter Census Summary Files and ACS Base Tables John Blodgett, Missouri Census Data Center.
Using American FactFinder John DeWitt Project Manager Social Science Data Analysis Network Lisa Neidert Data Services Population Studies Center.
Accessing and Using the e-Book Collection from EBSCOhost ® When an arrow appears, click to proceed to the next slide at your own pace. To go back, click.
Student Manager Catalog Builder An ACEware Webinar.
Rankster A web utility app to make it fast and easy to create extracts and reports with ranked data John Blodgett Nov, 2013.
Managing Grades with Excel Viewing Help To view Help 1.Open Excel on your computer. 2.In the top right hand corner of the Excel Screen type in the.
Showing You the Data John Blodgett Office of Social & Economic Data Analysis (OSEDA): UM - Columbia and Missouri Census Data Center (MCDC) Presented at.
1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
SUNY Morrisville-Norwich Campus- Week 7 CITA 130 Advanced Computer Applications II Spring 2005 Prof. Tom Smith.
SAS Training at OSEDA Fall, 2001 Objectives are to teach –Fundamentals of the SAS system (“SAS Classic +”) –Basics of the current OSEDA / MCDC data archives.
The MCDC Data Archive John Blodgett Office of Social & Economic Data Analysis University of Missouri Rev. May 2007
John Blodgett OSEDA, UMC September 26, th Annual Missouri Senior Tax Levy Board Conference.
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports [Your company name] presents:
XP New Perspectives on Microsoft Office Excel 2003, Second Edition- Tutorial 11 1 Microsoft Office Excel 2003 Tutorial 11 – Importing Data Into Excel.
There is a certain way that an HTML file should be set up. The HTML section declares a beginning and an ending. Within the HTML, there should be a HEAD.
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
6 th Annual Focus Users’ Conference Application Editor and Form Builder Presented by: Mike Morris.
Microsoft ® Office Word 2007 Training Mail Merge II: Use the Ribbon and perform a complex mail merge [Your company name] presents:
1 Agenda Views Pages Web Parts Navigation Office Wrap-Up.
November 3, 2011 Deborah de Bruin Building Digital Libraries.
PubMed/How to Search, Display, Download & (module 4.1)
Dexter The Missouri Census Data Center’s Data Extraction Utility Data Extraction Utility John Blodgett: OSEDA, University of MissouriOSEDA Rev.14May2007,
Lecture 6 – Form processing (Part 1) SFDV3011 – Advanced Web Development 1.
Database Applications – Microsoft Access Lesson 9 Designing Special Queries Updated 4/11.
Database Applications – Microsoft Access Lesson 9 Designing Special Queries.
Create Database Tables
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
Ten Things To Like About the Missouri Census Data Center’s ACS Profiles ACS Profiles As of Nov
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
PMS /134/182 HEX 0886B6 PMS /39/80 HEX 5E2750 PMS /168/180 HEX 00A8B4 PMS /190/40 HEX 66CC33 By Adrian Gardener Date 9 July 2012.
4/22/2017 5:36 PM EViews Training Creating Workfiles.
PHP meets MySQL.
Management Information Systems MS Access MS Access is an application software that facilitates us to create Database Management Systems (DBMS)
1 OPOL Training (OrderPro Online) Prepared by Christina Van Metre Independent Educational Consultant CTO, Business Development Team © Training Version.
CREATING TEMPLATES CREATING CUSTOM CHARACTERS IMPORTING BATCH DATA SAVING DATA & TEMPLATES CREATING SERIES DATA PRINTING THE DATA.
Downloading and Installing Autodesk Revit 2016
XP New Perspectives on Microsoft Access 2002 Tutorial 1 1 Microsoft Access 2002 Tutorial 1 – Introduction To Microsoft Access 2002.
MS Access 2007 Management Information Systems 1. Overview 2  What is MS Access?  Access Terminology  Access Window  Database Window  Create New Database.
Forms and Server Side Includes. What are Forms? Forms are used to get user input We’ve all used them before. For example, ever had to sign up for courses.
Analysing Data with Excel Viewing Help To view Help 1.On the Start menu, point to Programs, and then click Microsoft Excel. 2.On the Help menu,
Page 1 Non-Payroll Cost Transfer Enhancements Last update January 24, 2008 What are the some of the new enhancements of the Non-Payroll Cost Transfer?
IRS Migration Data & Profiles From the Missouri Census Data Center.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Ergo User Tutorial - Part 3 NCSA, UIUC.
Web QT Today Runs against the Online Transaction Processing (OLTP) Production Database Uses J2EE Architecture Designed to provide operational support.
Word Create a basic TOC. Course contents Overview: table of contents basics Lesson 1: About tables of contents Lesson 2: Format your table of contents.
Transportation Agenda 77. Transportation About Columns Each file in a library and item in a list has properties For example, a Word document can have.
Quick Videos: A tutorial on creating reports. Select a report and click this to view it. Select a report and click this to change it. Select a report and.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
South Dakota Library Network SFX Management Basics A – Z List & Citation Linker South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD
Access Queries and Forms. Adding a New Field  To insert a field after you have saved your table, open Access, and open the table  It is easier to add.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
Acsmcdcprofiles_extract A tool to make it much simpler to access the latest 5-year period estimates from the American Community Survey John Blodgett May,
Exporting & Formatting Budgets from FlexGen, NextGen & Zortec into Excel.
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
Orders – View and Print Boeing Supply Chain Platform (BSCP) Detailed Training January 2015.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Accessing ACS Data Using Missouri Census Data Center Web Tools
Boeing Supply Chain Platform (BSCP) Detailed Training
Creating and Modifying Queries
Module 5: Data Cleaning and Building Reports
Uexplore/Dexter Tutorial
Tutorial 7 – Integrating Access With the Web and With Other Programs
The ultimate in data organization
Training Document Accessing Reports in VinCENT.
Presentation transcript:

Dexter The Missouri Census Data Center’s Data Extraction Utility Data Extraction Utility John Blodgett: OSEDA, University of MissouriOSEDA Rev.14May2007, jgb

What Is Dexter? A web utility for performing simple data queries, or extracts. An integral part of the MCDC’s Uexplore- based data exploration/access system. Written in SAS© to access data stored in SAS datasets but requires no knowledge of nor access to SAS.

Who Uses Dexter? Anyone interested in accessing the MCDC data archive, especially anyone who wants to directly access and manipulate the data. Not (directly) intended for the very casual data user. Has a small but non-trivial learning curve. Understanding the mechanics of Dexter is easy compared to understanding the data to be extracted.

Dexter’s Role Within Uexplore Dexter accepts parameters that identify a database file/table from which data are to be extracted. Uexplore provides the navigation tools to help locate and understand the content of datasets. Uexplore hyperlinks actually invoke uex2dex, the dexter preprocessor, which in turn invokes dexter.

Uexplore Page With Hyperlinks Uexplore Page With Hyperlinks

The URL Used to Invoke Dexter On the previous screen the dataset name (usccflows.sas7bdat) is a hyperlink. The URL associated with it is: bin/broker?_PROGRAM=websas.uex2dex.sa s&_SERVICE=appdev9&path=/pub/data/mig2 000&dset=usccflows&view=0 bin/broker?_PROGRAM=websas.uex2dex.sa s&_SERVICE=appdev9&path=/pub/data/mig2 000&dset=usccflows&view=0http://mcdc2.missouri.edu/cgi- bin/broker?_PROGRAM=websas.uex2dex.sa s&_SERVICE=appdev9&path=/pub/data/mig2 000&dset=usccflows&view=0http://mcdc2.missouri.edu/cgi- bin/broker?_PROGRAM=websas.uex2dex.sa s&_SERVICE=appdev9&path=/pub/data/mig2 000&dset=usccflows&view=0 It calls a program named uex2dex, written in SAS, and passes parms to ID the data table to be queried.

Dexter and Census Data Dexter doesn’t really know much about the datasets from which it extracts data. It is not. It is just a generic extraction tool. It is not American FactFinder. It is just a generic extraction tool. American FactFinder It uses only very basic metadata tools. Other tools must be used to assist users in navigating the database.

Dexter and the MCDC Data Archive Technically, there is nothing inherent in Dexter that ties it to this archive. In practice, however, the collection of public data files that we call the “MCDC Data Archive” is what Dexter was created for. It is very probable the only reason you’re reading this is because you want to access something in that archive.

How Do You Invoke Dexter? Most people will start at the uexplore home page You navigate the data collection by choosing “filetype” directories and at some point (…yada yada yada) you wind up selecting (clicking on) a file that is a data table. Clicking on the data table invokes the uex2dex preprocessor. You fill out the form which uex2dex generates and click on an “Extract Data” button to actually invoke Dexter.

Accessing Uexplore (Home Page) From the MCDC home page (or any page with the navy blue navigation bar) click on “MCDC Data Archive”. Or enter the URL:

Choose Major Category (from the links in teal box)

Scroll Within the Filetype Descriptions to Find the Type (mig2000)

Click on the Filetype Name (links to uexplore for that directory/filetype) In this case we want to click on the mig2000 filetype. The text tells us what kind of data we can expect to find in this directory.

Uexplore Page - mig2000 Filetype mig2000 This page is all about hyperlinks (all the blue text). Before proceeding to the the Dexter-invocation links we want to back up and look at the data archive structure.

The Uexplore/Dexter Home Page (Back to) The Uexplore/Dexter Home Page

The Archive Directory (on the Uexplore/Dexter home page) The teal box contains links to 8 major data categories (2000 Census thru Compendia) The rest of the page consists mostly of descriptions of and hyperlinks to the archive’s data categories (which we refer to as filetypes.) Filetypes within the major categories are sorted in descending order of what we think will be their popularity. Sf32000x is our most popular filetype.

What’s In the Archive? Very important question. But not the focus of this tutorial. Some day we’ll do a separate, long tutorial just on that topic. Not all filetypes are created equal. We spend 90% of our resources on maybe 10% of our data directories. Filetypes that are in bold are the MCDC “house specialties”.

The Data Archive – General Info We keep the data table files (the things Dexter accesses) in the same directories along with other related files (metadata, spreadsheets, csv files, Readme.html files, etc.) Each filetype directory has a special Tools subdirectory where we keep program code and other tool modules related to the data. Subdirectories & Files starting with uppercase letters are listed first and are usually worth looking at. Dexter-accessible table files (“SAS datasets”) have extensions of sas7bdat or sas7bvew.

Exercise The Bureau of Economic Analysis disseminates its REIS data with key economic indictors for US geography down to the county level. Locate the filetype corresponding to this data collection and navigate to the directory page. What’s the major category?

Uexplore Data Directory Page What you see when you click on the beareis link on the Uexplore home page. It displays a list of files within the directory. The “File” column entries are hyperlinks. With a few exceptions the files are displayed in alphabetical order. Datasets.html is a special file providing enhanced navigation of the data files in this dir. It displays just the data-table files, but in a more logical order and with additional metadata.

Datasets.htmlDatasets.html page Datasets.html

Datasets.html Columns The Name column is also a link to The Name column is also a link to uex2dex / dexter. Label is a short description of the dataset. #Rows (# of observations) and #Cols (# of columns/variables) are taken from the datasets metadata set. As are the Geographic Universe and Units. Link to Details is the most important column.

Universe and Units The majority of datasets in the archive contain summary data for geographic areas. For example, a dataset in the popests directory might contain the latest estimates for all counties in the state of Missouri. The geographic universe is Missouri, and the units are counties. When we have many datasets in a directory it’s usually because we have many different combinations of universe and units.

Common Universes Missouri (the state of) is by far the most common universe for the MCDC archive. United States is second – we have quite a number of national datasets. Illinois and Kansas are also very common since we routinely download and convert census files for these key neighbor states. A common sort order for files on Datasets.html pages is Missouri files first, then US, then IL/KS and then other states.

Rows & Columns The rows of the data tables are typically geographic entities: a state, a county, a city, etc Most of the columns in the data tables are summary stats for the entity: e.g. the 2000 pop count, the latest estimated pop, the change and percent change, etc. Other columns (“variables”) are identifiers with names such as sumlev, geocode and areaname.

Numeric vs. Character Variables SAS© stores data as character strings or as numerics. We store all identifiers (geographic codes, etc) as character strings even if they are made up of numeric digits. So the value of the state code for CT is “09”, not 9. The leading “0” matters. Unfortunately, Excel ignores the distinction when importing csv files.

Dataset Naming Conventions All filetype names are 8 characters or less. Dataset names were limited to 8 characters by the software until recently. The first characters of the dataset name often correspond to the universe – e.g. “mo”, “il”, “us”. The geo units are often part of the ds-name – e.g. “motracts”, “uszips”. For time series data the name usually ends with a time indicator – e.g. “uscom03” contains data thru 2003.

Variable Naming Conventions Not as rigorously applied as we might like, esp. for older datasets (conventions used for 1980 datasets differ a little from 2K and 1990 sets, for example) Certain names appear on many datasets and are consistent. These are mostly identifier variables, the ones used in creating filters and for merging data from different files.

Consistency With Census Bureau Data Dictionary Names The Bureau often distributes data dictionary files with their data that include suggested names for the fields. Their name for the field (variable) containing the name of the geographic area being summarized is ANPSADPI. We decided to go with AreaName instead. But in most cases we try to use the same name as in the data dictionary.

Common ID Variables SumLev: Geographic summary level codes as used in 2K census. (3-char) State: 2-char state FIPS code. County: 5-char county FIPS code, incl. the state. Geocode: A composite code to id a geographic area. E.g. the value for a census tract might be “ ”. AreaName: Name of the area.

Common ID Variables (cont) Tract: census tract in tttt.ss format, always 7 characters with leading 0s and 00 suffixes. E.g. “ ”. Esriid: Similar to geocode but intended to use as a key for linking to shape files from ESRI (the ArcInfo people). When geocode=“ ” the value of esriid=“ ”.

SAS Formats Some variables have custom formats associated with them, which cause them to display a name instead of their actual value. E.g. the variable County may have a value of “29019” but displays as “Boone MO” using the format. Most Dexter output has the formatted values. E.g. the variable County may have a value of “29019” but displays as “Boone MO” using the format. Most Dexter output has the formatted values. Click the “View qmeta Metadata report” option at the end of Section II on the Dexter form to see which variables have formats associated.

More About the MCDC Data Archive mcdc_data_archive.ppt

Details Page Details Page We get here by clicking on the Details link on Datasets.html page. Lots of content here – but will vary. Key variables is often extremely useful when doing filters. Note the direct link to Dexter under Access the dataset near the bottom.

Increase Text Size to Read Fine Print

Exercise – Navigate to Dataset Earlier we were looking at datasets in the 2000 Census category, filetype mig2000. Go to the Uexplore home page and navigate to this filetype. Use the Datasets.html page to display the datasets within the directory. Find the row for the usccflows data table and click on the Details link for this table. From the Details page click on the keyvals link for the variable State.

Key Variables Report: State State Tells you that the variable State has a value of 01 (for “Alabama”) in rows of this dataset. This can be very helpful when doing a data filter in Dexter.

Finally… Time to See Dexter

Dexter Input Page (Top) Input Page Input Page Sec. I. Output Format(s): csv file (into Excel) most common. Sec. II is where the work is. Only 2 of 5 rows shown here. User fills out the entire form before using Extract Data button to invoke Dexter.

Dexter Section II

Filters “A filter is a logical condition that references values of columns within a row. For each row, the condition is evaluated and, if true, the row is selected for output. (If not, the row is omitted, or "filtered".) To keep all rows, just skip this section. The filter being created here can consist of up to 5 logical segments, each referencing a data set Variable, a relational Operator, and a data Value (or values) -- constants that the user must type in. The segments are evaluated as true or false. Logical operators (which default to And and appear between the segment specification rows) relate the segments when more than one is specified, creating a compound logical condition.” If this explanation makes sense to you then you are going to have an easy time with Dexter. If not, follow through the examples and then try reading it again. “A filter is a logical condition that references values of columns within a row. For each row, the condition is evaluated and, if true, the row is selected for output. (If not, the row is omitted, or "filtered".) To keep all rows, just skip this section. The filter being created here can consist of up to 5 logical segments, each referencing a data set Variable, a relational Operator, and a data Value (or values) -- constants that the user must type in. The segments are evaluated as true or false. Logical operators (which default to And and appear between the segment specification rows) relate the segments when more than one is specified, creating a compound logical condition.” If this explanation makes sense to you then you are going to have an easy time with Dexter. If not, follow through the examples and then try reading it again.

Example of Defining a Filter: What We Want Assuming we are running dexter to access the mig2000.usccflows dataset we want to select only those rows that: mig2000.usccflows mig2000.usccflows –have Missouri as the anchor state, and –have at least 100 gross flows. We’ll just assume you’ve read the descriptions and have some clue regarding what an anchor state and a gross flow are. (People interested in population migration would be likely to know this.)

Select Variable for Filter Click on the Variable/Column drop-down menu in the 1 st row and select State.

Select Comparison Operator Select “Equal to” as the Operator from drop down menu in the middle column.

Enter Value to Complete Row Remember the Key Values report showing all the values for the variable State? If you did not know the code for Missouri you could find it there.

What We Have So Far We have created a logical condition that can be evaluated for each row of the dataset: State = ’29’ According the key values report for State we know that this condition is true for 38,316 rows in the dataset. The filter we are building will select just those 38,316 rows out of the 1.1+ million in the full dataset. key values report for State key values report for State

Adding a Second Condition But we do not want all the cases pertaining to Missouri as the anchor state. We only want those where we have at least 100 gross flows (whatever those are). So we need to fill out a second row, adding this condition. We select GrossMig as the variable, Greater Than or Equal To as the Operator and enter 100 in the Value field. We leave the logical operator radio button set to “And” to indicate that this is an additional necessary condition.

The Completed Filter You are now ready to scroll down to Section III.

Section III: Choose Variables Conceptually simple section: just select the variables you want on your output from scrollable (if needed) menu lists. Identifiers (character type variables) are listed separate from numerics. Important MCDC Data Archive convention. Typing names instead of selecting is possible but not recommended. Here we select all variables except State. Note the Extract Data button.

Section IV: Title & Sort Order Entirely optional, typically not used section. Title value is used as report title if you asked for one, which we did not in this example. Sort specs are handy. Note use of minus sign (hyphen) to indicate a descending sort. Another Extract Data button to use to run query.

Dexter Output Page The first output you see is this results “index” page. Always a link to a Summary Log page Additional links depend on output formats requested.

Dexter Summary Log This file always generated. Important for documenting the query. Indicates what file(s), when run, as well as any filter and the variables kept. Output directory details can usually be ignored.

Select Output File(s) Click on Delimited File Link What happens when you click on this file depends on how your browser is configured. The file referenced has a.csv extension which IE usually associates with the Excel plugin. Clicking this link will typically invoke Excel.

Viewing.csv Output in Excel The csv file is read into Excel. Rows 1 & 2 have names & labels. Other rows contain the selected data. Note sort order.

Some Key Points So Far Navigation tools such as the uexplore home page, the uexplore directory navigator and Datasets.html reference pages are used to make accessing data with Dexter easier. You get to select rows (“filter”) and columns as well as the format(s) of your extracted data. Filtering often requires knowledge of code values. These can sometimes be accessed from the Key Values reports on the Details page referenced by a Datasets.html page. The query generated is summarized on a Summary.log page.

Pop Quiz 1. Can Dexter be used to access an xls file? 2. How are the files sorted on a directory page displayed by uexplore? 3. What does the uex2dex interface app do? 4. What is the fastest way to tell how many rows were selected by your query? 5. Which of the 5 sections of the Dexter query form must be filled out to have a valid request? 6. What’s a filetype? What does it mean when one is displayed in bold on the Uexplore home (Archive Directory) page?

Sample Query 2: What We Want We want data from the 2000 Census, Summary File 3 regarding poverty in Missouri – in cities and counties. We want the number and the % of poor persons, as well as the median household income. We only want the data for cities of at least 5000 persons, but for all counties and for the state as a whole. We want output as an HTML file sorted by the type of geography (state, county, city) and then by descending poverty rate.

What You Need to Know You need to know where these kinds of data are stored. It is 2000 census data, but where among all those different summary files? Read the brief descriptions on the uexplore home page. The sf32000 filetype looks good, but it turns out that it is too big. The standard extract version, sf32000x, has what we need. uexplore home uexplore home An alternate way by which users may arrive here is via links on the MCDC Demographic Profile reports. MCDC Demographic Profile MCDC Demographic Profile

A Demographic Profile Report A Demographic Profile Report A link at the bottom of this report page invokes Dexter with the appropriate dataset selected. Follow the link (in title of this page) and try it.

(Back on uexplore home page) Click on sf32000x to Start sf32000x Descriptions with links from the uexplore home page. uexplore home pageuexplore home page

The sf32000x Directory ( As seen by uexplore) Subdirectories & files with upcased first letters are shown first. Index.html, Readme.html and, of course, Datasets.html are required reading (browsing). Files are in alphabetical (not logical) order.

(sf32000x) Readme.html Readme.html

The Datasets.html Page (for the sf32000x filetype ) Datasets.html

Details Page Details Page -- sf32000x.moi Details Page Lots of info here. Most important is perhaps the Key variables link for variable SumLev (geographic summary level). SumLev

Key Variables Report for SumLev (stf32000x.moi)

Filters Based on SumLev VarOperatorValueResults SumLevEquals040 State Level Summary (only 1 row selected) SumLevEquals140 Census Tract Summaries – 1320 rows selected. SumLev In List 040:050:160 1 State level, 115 County level & 972 Place level rows.

Sample Query 2: What We Want (Repeated in case you forgot) We want data from the 2000 Census, Summary File 3 regarding poverty in Missouri cities and counties. We want the number and the % of poor persons, as well as the median household income. We only want the data for cities of at least 5000 persons, but for all counties and for the state. We want output as an HTML file sorted by the type of geography (state, county, city) and then by descending poverty rate.

A Complex Filter

The Filter Explained There are 2 logical parts to the filter: 1.SumLev In (‘040’,’050’) 2.Sumlev = ‘160’ and TotPop >= 5000 The parentheses checkboxes are used to group the 2 nd & 3 rd lines. The and between lines 2 and 3 is executed before the or between lines 1 and 2.

The Filter Explained, cont. The SAS© code generatd by these menu choices : where sumlev in (‘040’,’050’) or (sumlev=‘160’ and totpop >=5000); The “in” operator (called “In List” on Operator pull-down menu) allows specifying that the value of a variable should be one of a list of values. Those values are entered separated by :’s in the Value column of the filter specs form.

Completing the Query: Parts 3 & 4

HTML Output We see that Pemiscot has the highest poverty rate of any county. How do we know this? Why don’t we see any data for cities?

Exercise Access the same dataset as in the example: sf32000x.moi Select census tract summaries in Greene co… … with a poverty rate of at least 10%. Keep all identifiers necessary to identify the tract, and all variables related to poverty. Generate a csv file and load it into a spreadsheet (probably Excel).

Exercise 2 Repeat the previous exercise except do it for all counties (instead of census tracts) in the states of Arkansas and Oklahoma. Sort the results by descending poverty rate and generate output in pdf format as well as a csv file. Hint: A good place is start is with the Datasets.html page. Datasets.html

End of Show Questions and Comments to: