Presentation is loading. Please wait.

Presentation is loading. Please wait.

Click to edit Master title style Click to edit Master subtitle style DDI Across the Life Cycle: One Data Model, Many Products IASSIST Meeting Tampere,

Similar presentations


Presentation on theme: "Click to edit Master title style Click to edit Master subtitle style DDI Across the Life Cycle: One Data Model, Many Products IASSIST Meeting Tampere,"— Presentation transcript:

1 Click to edit Master title style Click to edit Master subtitle style DDI Across the Life Cycle: One Data Model, Many Products IASSIST Meeting Tampere, Finland May 29, 2009 Inter-university Consortium for Political and Social Research (ICPSR) and Survey Research Operations (SRO)

2 Presenters Mary Vardigan, Assistant Director, ICPSR Sue Ellen Hansen, Director, SRO Technical Systems Group Peter Granda, Archivist, ICPSR Sanda Ionescu, Documentation Specialist, ICPSR Felicia LeClere, Associate Research Scientist, ICPSR

3 The Collaborators Both are units of the Institute for Social Research, University of Michigan – ICPSR is a large social science data archive – SRO is a data collection center

4 Past Collaborations Working together on the National Survey of Family Growth, sponsored by NCHS, to create data and an interactive codebook Partnered on the Collaborative Psychiatric Epidemiology Surveys, sponsored by NIMH – This involved a harmonization of three datasets and interactive documentation featuring question comparison and five languages – www.icpsr.umich.edu/CPES www.icpsr.umich.edu/CPES

5

6 Rationale for Collaboration We share a need for rich, high-quality metadata We want to comply with metadata standards – in particular, the Data Documentation Initiative (DDI) DDI 3 enables life cycle perspective We need to pass data easily from SRO to ICPSR without information loss

7 SRO-ICPSR Joint Project Shared DDI-compliant data model and database design for survey metadata Challenges: – Different computing platforms – Different end products – Different staff orientations

8 Blaise Datamodel (BMI) SRO Blaise Parsing Tool Blaise Database (BDB) SRO Relational Database (online/networked SQL Server) SRO Relational Database (online/networked SQL Server) Client Relational Database (offline SQL Server Express) Client Relational Database (offline SQL Server Express) DDI 2 or 3 File ICPSR Import Tool User specifies files (location, file type, etc.) using an application Other File Types (e.g. SAS, SPSS, etc) ICPSR Relational Database (online/networked Oracle) ICPSR Relational Database (online/networked Oracle) SRO/ICPSR/Other web client Offline\Local Application Online or Offline Online or Offline Task B Task A Task B and D Tasks C and D Other Importing Tool Client Relational Database (offline SQL Server Express) Client Relational Database (offline SQL Server Express) Export data Display meta- data Stand-alone client application Client application with sync data Edit / Review meta- data Export ques- tionnaire Export code- book Web server ICPSR web client:: Variable Search Internal Variable Browser NSFG Data Management

9 Products and Benefits SRO Tools to enhance MQDS, which produces XML documentation from Blaise instruments Tool to permit external users to add metadata for NSFG ICPSR Variable-level database that permits users to search across the ICPSR collection; compare variables; create new datasets and questionnaires Internal variable search for harmonization

10 Data Life Cycle Coverage

11 Michigan Questionnaire Documentation System (MQDS) Sue Ellen Hansen Nicole Kirgis

12 What Does MQDS Do? Facilitates automated documentation and harmonization of Blaise survey instruments and datasets – Extracts survey question metadata – Standardized format

13 Survey Question Metadata Question universe Variable name and label Question text Question variable text (fills) Data type Code values and code text Skip instructions etc.

14 Data Documentation Initiative (DDI) Standard specification for technical documentation of social science data eXtensible Markup Language (XML) – Widely used – Facilitates sharing of data Initial focus on standard dataset codebook Ongoing development http://www.ddialliance.org/

15 MQDS Version 1 Extracted metadata from Blaise data model as XML tagged data Provided user interface for selection of – Blaise files – Instrument questions and sections – Types of metadata to extract – Languages to display – Style sheet for generation of instrument documentation or codebook

16 Using MQDS V1 XML: Codebook in Five Languages National Latino and Asian American Study www.icpsr.umich.edu/CPES

17 MQDS Version 1 Limitations – XML not DDI-compliant DDI Version 2 did not have XML tags for all metadata provided by Blaise Did not provide easy means of adding XML tags without becoming noncompliant – XML files for complex surveys can be very large (text files) Entire files had to be processed in computer memory Limited ability to fully automate documentation

18 DDI Version 3 Released April 2008 Focus on complete data lifecycle –going beyond the codebook

19 DDI Version 3 Included extensions proposed by DDI working group on instrument design Persistent Content of QuestionUse of Question in Instrument Question text Static Dynamic or variable Order and routing Sequence / skip patterns Loops Multiple-part questionUniverse Response domain Open Set categories Special types (date, time, etc.) Analysis unit Definitional textInstructions

20 MQDS Version 3 Joint SRC and ICPSR venture Goals: – Address version 2 limitations Process Blaise instrument of any size – Exploit new elements and validate to the recently released DDI version 3 standard – Move from processing XML metadata in memory to streaming metadata to a relational database

21 MQDS Version 3 Relational Database: Import, Export, Transform 3. Transform 1. Import 2. Export XML (DDI 3) User specifies output files (location, Language/locale, XML output options, etc.) Codebook Questionnaire User specifies stylesheet selection criteria, type of output desired (html, rtf, pdf), etc. User specifies input files (location, file type, etc.) Blaise Datamodel (BMI) Blaise Database (BDB) Other File Types (e.g. SAS, SPSS, etc) Relational Db Relational Db SQL Server / SQL Server Express Database connection settings DDI 3 elements not in *.bmi

22 MQDS Version 3 Relational database – DDI compliant standardized tables – Flexibility for SRC and ICPSR to add extensions that meet their specific organizational needs – Allows Automated documentation of any Blaise survey instrument Importing and documenting data produced by other software Lower cost development of other tools that facilitate editing and disseminating data

23 MQDS V3 Prototype: Exporting Language XML

24 MQDS Development Expect to release Summer 2009 Working out a distribution plan for Blaise users

25 Data Life Cycle Coverage

26 Applications: Customized Editing Tool Peter Granda ICPSR

27 MQDS Version 3 Relational database – DDI compliant standardized tables – Flexibility for SRC and ICPSR to add extensions that meet their specific organizational needs – Allows: Development of new tools to deal with the practical problems involved in transforming data and documentation derived from BLAISE instruments into public-use products

28 Features of the Tool Loads MQDS output into database tables Web interface to permit quick viewing Application that permits both internal and external clients to access and edit variable-level information Ability to include disposition codes to designate which variables to include in public-use files Maintain permanent record of decisions made throughout the editing process

29

30

31

32

33 SELECT VARIABLE TO EDIT FROM DATABASE POPULATED WITH METADATA FROM MQDS WITH POSSIBLE REVISIONS FROM SUBSEQUENT DATA PROCESSING STEPS Variable Name Variable Label Universe Statements Value Labels List of Standard Formats Question Text VARIABLE DISPOSITION: Place in public-use file Place in restricted-use file Leave in original file created by the data producer

34 Data Life Cycle Coverage

35 Social Science Variables Database: The Public Search Sanda Ionescu ICPSR

36 SSVD – The Public Search ICPSR variables search – Internal (staff, other authorized users) – External (public)

37 SSVD – The Public Search Enables ICPSR users to search variables across datasets Assists in data discovery, comparison, harvesting, and analysis Useful in question mining for designing new research

38 SSVD – The Public Search Concept first tested in a pilot project completed in 2005 – Good functionality – Demonstrated benefits of using DDI markup: easy import; complex, granular searches; user- friendly display – Limited number of data sets (69 ICPSR studies included)

39 SSVD – The Public Search Expand the project to ultimately include most of ICPSR’s holdings – Generate DDI documentation for most ICPSR studies Need for automated production – Build a solid, state-of-the-art, DDI compliant database Handle large number of files Support multiple applications

40 SSVD – The Public Search The Hermes batch processing system *: SPSS system / portable file (Mandatory) Question text file in fixed format (Optional) ASCII data file Statistical setups: SPSS, SAS, Stata Ready-to-go data files: SAS transport, SPSS portable, Stata system DDI 2.1 variable-level documentation with frequencies [and question text (optional)] PDF Codebook (Part of ) *This is a simplified diagram

41 SSVD – The Public Search Hermes: – Consistent, reliable source of variables descriptions in DDI – DDI documentation limited to content of input files Labels may be truncated or may contain abbreviations Question text may be missing although available in original documentation

42 SSVD – The Public Search Additional quality standards necessary for DDI documentation, to maximize effectiveness of Public Search: – Presence of question text, whenever available – Increased readability of variable/value labels, especially if question text is not present

43 SSVD – The Public Search Not all ICPSR studies qualify for variable- level searches Criteria for selecting studies; not included: – Aggregate/statistical data (ex. Census data, Data Books, Roll Call records, etc.) – Poor documentation – Some restricted data

44 SSVD – The Public Search Pre-SSVD upload: – Review of DDI output from Hermes to apply content quality standards and study selection criteria – Additional work to upgrade DDI where necessary (and feasible) Add question text Complete truncated text Improve readability of labels Add frequencies

45 SSVD – The Public Search Preparing studies for SSVD: – Started end of 2006 – Included DDI produced for previous projects – Reviewed all variable-level DDI created at ICPSR, November 2006 to present (new releases and updates)

46 SSVD – The Public Search New database finalized Fall 2008 Built to match DDI 3.0 data model Both DDI 2.x and DDI 3.0 compliant – Designed to accept both DDI 2.x and 3.0 input and produce output in both versions ICPSR version currently uploads DDI 2.1 and generates DDI 3.0 individual variables descriptions.

47 SSVD – The Public Search First batch of variable-level description files uploaded into SSVD: – Approx. 3,500 DDI files (one file per dataset), representing Approx. 1,300 ICPSR studies (approx. 18.5 percent of total ICPSR holdings, excluding US Census; approx. 30 percent of holdings with data and setups) – Over 1,000,000 individual variable descriptions; 23,000,000 categories

48 SSVD – The Public Search Currently in Beta-testing phase. – Email bugs at ssvd-testing@icpsr.umich.edussvd-testing@icpsr.umich.edu Uses Oracle Text. http://www.icpsr.umich.edu/ICPSR/ssvd/index.html

49 SSVD – The Public Search Moving forward… Fall 2009: switch to Solr searches (based on Lucene) – Faster – More sophisticated: results filtered by multiple relevant parameters Enable side-by-side/same page display of selected variables for comparison Enable variable search from individual study page (search within study)

50 SSVD – The Public Search Moving forward… Adding content: – Second batch of DDI files ready to upload: 900 DDI files, representing 500-600 studies (will bring total close to 45 percent of ICPSR studies with data and setups) – Initiate retrofit project to examine older studies that were not covered in the first conversion phase

51 SSVD – The Public Search Moving forward… Transition to automated DDI upload – DDI uploaded at the time of study publication – First quality check performed by study processing staff – Acceptable DDI immediately released for public view – Problematic DDI suppressed from public view for further review, and upgrade as appropriate

52 Data Life Cycle Coverage

53 Applications: Internal Variable Search and Documentation Felicia LeClere, ICPSR

54 The Integrated Fertility Survey Series 5 year grant from NICHD to harmonize data from 10 large surveys of marriage, fertility, and child-bearing in the United States 10 surveys beginning in 1955 through 2002

55 Problem of Harmonization In order to make decisions about harmonizing across all files need: Question text Value labels and categories Be able to find and export metadata from all 10 files at the variable level Be able to document each variable, recode and variable choice

56 Tools from Variables Database Need to be able to do nested searches that are documented Need to be able to search all fields individually and in sequence Need to be able to download results and document what search terms were used

57 ICPSR SSVD Internal Search All 10 data sets were loaded in ICPSR’s version of the shared data base Designed to capture all of the relevant fields that were marked up in DDI

58 Entry screen for internal search

59 Search results screen

60 Excel download from search Can also download value labels and codes

61 Search Utilities Downloaded search fields serve to: – 1. Identify variables to be harmonized – 2. Provide metadata for “translation tables” which are used to harmonize files

62 Harmonization steps Use search results to populate two intermediate steps to reforming data set Exploratory comparative tables » Use this comparative table to make decisions about harmonization by examining universes, question texts, and response categories Translation tables » These tables are designed to provide instructions on recoding the underlying items from the 10 surveys to a single harmonized item. The table provides instructions to an automated SAS program that recodes items from 10 surveys.

63 63 Comparative table – date of birth

64 Translation Table for place of birth

65 Harmonization steps After the translation table, the recode instructions for all 10 files are built into the SAS file and a new data file has been created. The underlying metadata data provided by the database allow us to (1) search all 10 files, (2) explore comparability and (3) recode to new variables


Download ppt "Click to edit Master title style Click to edit Master subtitle style DDI Across the Life Cycle: One Data Model, Many Products IASSIST Meeting Tampere,"

Similar presentations


Ads by Google