ICPSR Tools for the Metadata Portal

ICPSR Tools for the Metadata Portal
Sanda Ionescu, Documentation Specialist, ICPSR Talk about tools we’ve put together at ICPSR for Met P. All ddi-based. Some are new – some not so new Work together really well to enhance user experience with GSS and ANES collections We have these tools working against the ddi metadata created in the project on a small site – will be directly accessible from the Portal Plan is to briefly review these tools Spend a little more time on one of them Innovative tool built at ICPSR for the project Hope it will be beneficial to a wider range of studies The Reverse Universe Generator.

Web tools based on DDI documentation
Standardized, consistent, machine-actionable documentation allows building a variety of tools to Significantly enhance user experience in data discovery, exploration, and analysis Facilitate further enhancement of metadata The process of enriching the metadata with additional information may be automated when the ‘source’ documentation is already in standardized format Tools demonstrate the advantages of working with DDI documentation because Based on this standardized doc/metadata we can build applications that NOT ONLY (significantly) assist BUT CAN ALSO CONTRIBUTE TO ENHANCE automate the process by which doc, is enhanced by adding new content. The standardized metadata makes it possible to build applications that 1. Support data discovery and analysis BUT CAN ALSO 2. Automate the process of enhancing the metadata by adding new content

ICPSR tools for the Metadata Portal
Based on complete, study- and variable-level DDI metadata SOLR searches with enhanced features Facets built from the study descriptions for narrowing search results Retrieval of variables measuring separate concepts within a single study (supports discovery for data analysis) Fielded searches, including by question text only (allows exploring questions for instrument design and research) FOR THE PORTAL WE HAVE A SEARCH TOOL with SPECIAL FEATURES Some enhanced display features and a couple of new tools that ARE ACTUALLY CAPABLE OF ADDING new content to the metadata and therefore new functionality to it So I’ll try to DEMO these tools directly on the WEB…..

Enhanced display features: Contextual variable display (to place individual variables in a specific context and facilitate navigation among variables in a dataset) “Compare variables” layout (facilitates comparison by displaying descriptions of selected variables on the same screen) Question routing diagrams based on DDI output from the Reverse Universe Generator tool.

Cross-study/cross-collection comparison and harmonization: Creating taxonomies: public / private Tagging variables with concepts Building variable crosswalks Tagging enhances DDI metadata by including additional information. It may be “automated” because the “source” metadata is already in DDI

Reverse Universe Generator (RUG)
ICPSR-created tool Developed as proof of concept – not currently a production tool Combines an analysis of “raw” (ASCII) data and DDI metadata to identify skip patterns in a dataset Enters the findings in the DDI instance in the form of variable-level universe statements Tool entirely created at ICPSR Dev. Proof of concept – not in production mode still testing to see how we can improve it In very broad terms, it analyzes the data against ddi metadata trying to reconstruct skip patterns in.. Once these are found, the inform. Is entered in the ddi instance in the form of variable universe statements

Questionnaire routing patterns Difficult to identify and use particularly in legacy studies Sometimes entirely missing from the resulting data documentation (codebook) Sometimes available in a variety of free text, unstandardized formats that are not machine-processible and difficult to follow The main reason that led us to develop RUG As archivists, often confronted with the problem of qstn routing patterns being Difficult or even impossible to identify in the final study documentation. Sometimes info is entirely missing Sometime it is available in formats that are not machine processible and Difficult to follow even with the human eye As Problem we identified: Routing patterns: Important for data analysis Either unavailable/or not easy to use

Facsimile of survey instrument – non-processible, disconnected from the data description Example of survey instrument Published separately from the data description which is in a codebook Qstn flow REPRESENTED in the form of drawings / diagrams which is not at all machine-actionable Moreover, entirely disconnected… We can see qstns but don’t know to which variables they belong

Text codebook – non-standardized entries, routing patterns non-intuitive, difficult to reconstruct, or even missing… ANES Time Series ANES Panel Recontact 2010 2 ex. of codebooks, include the data description. left – older (study, and codebook): some information available, but not complete. Var to qstn (which var?) Reference to NO. 3 (is var or question?) Need to browse codebook back and forth to find the variable that’s based on Q.10, need to identify whether Ref. no. 3 is indeed a variable, and find it in the dataset, etc. -second example (newer study) routing info entirely missing Value -1 assigned to skips, but know nothing about their ORIGIN

Premise: in the dataset, system missing values on the “dependent” variable will exactly map to the values that generated the skip pattern on the “independent” variable RUG regresses each variable containing a “system missing” code on every other variable in the dataset to determine routing dependencies based on any exact matches identified between values, or sets of values If there is a ROUTING DEPENDENCY -- Directly match/map Based on this assumption Rug analyze/regress looking for these exact matches.

0 = INAP. (is “sysmiss” throughout the dataset) Table drawn to represent in a simplified/schematic way /CLARIFY what is going on in the dataset. We have a dataset in which the value 0 is consistently assigned to sysmiss values On variable 62, 0 maps to all of the 5, 8, and 9 on V61. Therefore we can conclude routing dependency – (skip pattern) CONDITIONAL FLOW Matches on 0 ignored by RUG because can generate false dependencies. Example V60 – matches on 0 only, will not be found as the origin of a skip pattern. But 0s on V60 through V62 match the value 1 on V3 therefore they will all be found to be dependent on V3.

RUG acts on information provided by the DDI metadata Categories marked as “system missing”: <catgry missing=“Y” missType=“sysmiss”> Category value Variable location in the dataset RUG finds the variable in the ASCII dataset and looks for perfect mappings between the value defined as “system missing” and other sets of values on different variables RUG finds the info it needs regarding variables in a dataset In the DDI metadata Will first look for “system missing” flags on categories Then will read in the value assigned to that category And the variable location - Then moves on to the dataset and looks for value matches as previously described/ shown

Once a routing dependency is identified, RUG inserts a universe statement in the “dependent” variable DDI description, pointing to the source of dependency (the “independent” variable) and listing the relevant categories involved: <var><universe>V61 was asked if the answer to V3 was 2, 3, or 4</universe></var> If a routing dep. Is found, the TOOL will

Universe statement: free text, machine-readable but not software-friendly RUG adds internal links (xslt-processible) in the variable universe and the related category labels, thus enabling two-way Web navigation between the variables involved in a dependency, and visually recreating the original routing scheme -The added universe statement – valuable info But not programming friendly. Software developers do not like to program against free text, even if it included in a standardized specification So we’ve added a feature to RUG in the depd. Var universe and the independents categories These are processible by stylesheets and enable WEB navigation back and forth between variables involved in a skip pattern Allowing us to recreate the routing scheme -allowing web navigation

Web presentation with routing diagram and live links. “Dependent variable”: Example of presentation of skip patterns

Web presentation with routing diagram and live links. “Independent variable”: Links to dependent are added in the “go to” statements at category level. (in the relevant categories)

RUG tool characteristics / limitations: RUG only works as expected if system missing values are uniquely and consistently coded and labelled as such. This condition is not satisfied by many studies The tool does not propose to differentiate between categorical variables and variables defined as “continuous” in some statistical systems RUG only examines individual values (no ranges of values) in the dataset Weights can be eliminated from the analysis Characteristics of RUG - The most important to note – NOT A limitation of the tool but rather of many datasets we’ve looked at We found many instances in which the same code is used to mark system missings as well as other types of missings like DK, Refused, not available (NA) [[This finding also important because we think RUG can act as an incentive for data producers to try to distribute cleaner datasets if these can be further (???)]] --STATISTICAL systems define as continous vars that are in fact categorical, based only on the number of categories, or other criteria: (one negative value, or no values less than 10 [SPSS]) Weight ARE CONTINUOUS.

First tests and adjustments to eliminate false hits Option to search only preceding variables for sources of dependencies Option to eliminate weights from the analysis (weight variables need to be flagged in the DDI input file) Option to ignore “sysmiss to sysmiss only ” matches Testing continues for potential sources of false hits (CASEID, derived variables) and other minor bugs First round of tests led to some changes in the tool Knowing that in most cases the variable order mainly follows the question order Weights are in fact CONTINUOUS variables – they should be eliminated from the analysis. CASEID has a great number of unique values – they will certainly match values on all variables Derived variables – not part of the original questionnaire, but may have values matching the source variables from which they were BUILT

RUG and DDI DDI input: The “system missing” and “weight” flags required by RUG may be added in an automated way (if not already present) DDI output: RUG adds value to the original metadata by generating variable universe statements that can be used to recreate the questionnaire flow and are relevant to data analysis. Importance of DDI documentation for this tool cannot be overstated RUG works on DDI {input} In addition to this, having standardized metadata means that The flags required in the input file may be added in an automated way – if missing RUG can contributes to enrich the [original] DDI metadata BY ADDING NEW CONTENT in the form of universe statements.

Questions? Thank You!

ICPSR Tools for the Metadata Portal

Similar presentations

Presentation on theme: "ICPSR Tools for the Metadata Portal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ICPSR Tools for the Metadata Portal

Similar presentations

Presentation on theme: "ICPSR Tools for the Metadata Portal"— Presentation transcript:

Similar presentations

About project

Feedback