Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Data Discoverability and Interoperability with DDI Metadata

Similar presentations


Presentation on theme: "Improving Data Discoverability and Interoperability with DDI Metadata"— Presentation transcript:

1 Improving Data Discoverability and Interoperability with DDI Metadata
Barry Radler Distinguished Researcher (UW-Madison Institute on Aging) Jared Lyle Director (DDI Alliance) and Archivist (ICPSR) Jon Johnson Technical Lead (CLOSER, UCL)

2 Overview Barriers to sharing data and metadata
DDI: the metadata standard for Social Science DDI use cases in research projects: MIDUS portal CLOSER portal DDI use case with a data repository: ICPSR archive DDI Takeaways

3 Barriers to Sharing Data and Metadata

4 Metadata are like punctuation

5 ...for your data

6 Sharing data and metadata
Data are meaningless without metadata Data requires good documentation for understanding Metadata can act as a glue retaining information through the data lifecycle Description of provenance Context of data collection Common structure and semantics allows consistent processing Retains the definition and relationships between the different elements Different agencies and clients have different systems Taking over a survey from another agency often requires re-inputting everything Questionnaire specification quality and format differences Different clients have different requirements

7 The survey process, especially on large scale longitudinal studies, involves many actors / organizations, reducing information loss between them is critical to increasing quality and reducing costs

8 DDI: the Metadata Standard for Social Science
The Data Documentation Initiative is an international standard for describing social science metadata in distributed network environments.

9 DDI Adopters DDI is being used in over 80 countries around the world. Major projects producing DDI include: CLOSER - UK longitudinal studies Consortium of European Social Science Data Archives German Microcensus Data Archive International Household Survey Network (IHSN) Midlife in the U.S. (MIDUS) longitudinal study Statistics Canada Statistics Denmark U.S. Bureau of Labor Statistics Inter-university Consortium for Political and Social Research (ICPSR)

10 Why use it? Advantages: A Free and Open Standard (XML)
Introduces a common communication protocol to research processes Increases transparency across systems and software Interoperates with other standards such as DataCite and Dublin Core

11 Benefits of using DDI Makes research data:
Independently understandable To secondary users without data provider responding to individual queries Critical information about research data is identified with standard ‘tags’ Machine-actionable Reduce manual processes or transcription between steps of systems Increase transparency within and between organisations Data require metadata for structured reuse throughout the data lifecycle Discoverable, Dynamic, Interactive!

12 Before DDI ... Example: And now a few questions about you…
At present, how satisfied are you with your LIFE? Would you say A LOT, SOMEWHAT, A LITTLE, or NOT AT ALL 1. A LOT 2. SOMEWHAT 3. A LITTLE 4. NOT AT ALL

13 After DDI ...

14 DDI Use Cases in Research Projects

15 Use Case: Midlife in the US
I will be discussing how DDI has benefitted the MIDUS study. To do that I want to begin with some background about the study to provide context for our application of DDI. MIDUS (Midlife in the U.S.) is a national longitudinal study of health and well-being begun in 1995. The study was conceived by a multidisciplinary team of scholars that wanted to examine aging as an integrated bio-psycho-social process. This and other unique features of MIDUS have resulted in a complex study design that is a rich data resource for researchers, but it also poses documentation and distribution challenges. We have found that DDI offers a rich structure or scaffolding around which this study can be described.

16 Use Case: Midlife in the US
Key characteristics of MIDUS: Multiple longitudinal samples Multidisciplinary design Data products 22 + datasets and growing 25,000 variables N <13,000 Wide secondary usage – Open Data philosophy Top data download at ICPSR 95k data downloads; 48k users 900+ publications Multiple samples obtaining longitudinal data over relatively long intervals (around 9-10 years between assessments). I mentioned the multi-disciplinary design. MIDUS is actually comprised of 5 distinct data collection projects that collect survey, cognitive, daily diary, biomarker, and neuroscience data. These projects are managed by different sites across the country and each produce their own datasets. This approach has produced a large and growing amount of longitudinal data to distribute. To date: 22+ datasets of very different subject material… Containing over 25k variables… On over 13,000 participants MIDUS is currently funded by the National Institute on Aging, and as a NIA-funded study, MIDUS takes its data sharing obligations seriously. Archiving and distributing MIDUS materials thru the ICPSR archive at the University of Michigan has resulted in the widespread secondary usage of MIDUS data. Since 1999: Among the most popular studies at ICPSR Downloaded over 95k times, 48k unique users have accessed the MIDUS materials at institutions worldwide Over 900 pubs generated using MIDUS data. The popularity of MIDUS data to the research public requires a robust approach to documentation. With so much secondary usage of the data I want to bake as much information as I can into the documentation and do so in a standard way that can be understood by humans and machines. That goal is accomplished in large part by assembling richly structured metadata. Metadata convey information necessary to fully exploit the analytic potential of the data. In the case of a widely distributed or shared data, metadata becomes the de facto form of communication between secondary researchers and the original data producers.

17 Use Case: Midlife in the US
Particular benefits of DDI Lifecycle (3.2) for MIDUS: Intelligent search function Searches different fields: variable name, label, question text, assigned concepts Search results are arrayed Intelligent searches across ALL 25k MIDUS variables Harmonization (internal, post-hoc) Clarifies the related nature of versions of longitudinal and cross-cohort variables Facilitates Custom Data Extracts Researchers can focus on variables of interest Facilitate accurate merges across numerous datasets Ease data management burden The communication of research metadata is greatly enhanced by employing a metadata standard such as DDI to organize and distribute it. And the most recent version of DDI Lifecycle (3.2) is very beneficial for describing longitudinal studies. Using it and a DDI tool called Colectica, MIDUS is able to provide data users these benefits: Intelligent Search: the Portal can search key words in specific metadata fields, across all MIDUS datasets, and array the search results. Intelligent search is crucial to secondary researchers who may not be intimately familiar with MIDUS 25,000 variables. Internal harmonization: A principal reason for conducting longitudinal research is to compare the same measures over time. Using DDI, the Portal can clarify which variables are related to each other and provide data users a headsup when those variables might not be strictly equivalent. Finally, with the Portal able to search across 25k variables and to display information on related variables, we went a step further and introduced a custom data extract function. Instead of requiring users to download whole datasets, whittle them down to variables of interest, then perform multiple merges to combine the datasets, the MIDUS Portal has simplified (in one step, in one place) finding, accessing, and merging these variables into a dataset. While DDI’s not in the business of delivering custom datasets per se, that function can be facilitated by marking up the metadata up in a richly structured, machine-actionable manner using DDI.

18 The MIDUS Colectica Portal http:\\midus.colectica.org
Next slides are screen shots from the Portal to demonstrate these benefits: This is simply a shot of the Portal’s home page APPEAR The portal can be freely accessed after opening an account. If DDI interests you at all, I would encourage you to check out the URL, explore the Portal, and feel free to send me any questions or comments about it.

19 I’m going to start with the Search function because it’s probably the easiest entre’ to the Portal, especially if a user is not familiar with the study. AAPEAR: Here I’ve performed a search for variables related to smoking. The Portal searches metadata fields such as question text, variable labels, and even assigned concepts – it highlights in yellow the text found in these different fields and you can see it looks for variations on the search term. APPEAR: The search results also show “breadcrumbs” for each variable, metadata about its source, the data file, instrument, and variable group. Let’s say I want to examine the variable C1PA41.

20 I can click on the variable name and explore the full variable-level metadata in detail, including
APPEAR Label, question text, interviewer instructions, skip patterns, which dataset it’s found in, harmonization information, and basic descriptive stats.

21 I can also manually explore individual projects by topic.
APPEAR: in this case, I want to look at survey variables related to history of smoking. Under the Explore tab, I see a concordance table list all the different versions of smoking variables across SIX datasets APPEAR The concordance table shows where equivalent versions exist, but also where there variables are absent. In this view, I can click on any individual variable and examine its metadata.

22 Or I can click on the concept/category
APPEAR: And can see histograms of equivalent variables across datasets; The Portal also reminds me of potential differences among these variables. In this example, comparability notes tell me this particular measure was introduced at later waves and is absent from the baseline dataset. A Visualization tool also provides a graph of the distributions of each variable and is very useful for quick comparisons.

23 The Portal can also toggle what type of visualization to display, depending on level of measurement
APPEAR: Here’s an example of a continuous numeric variable (Age last smoked cigarettes), Which produces a box plot visualization of each version of the variable. If I want to share this particular view with someone, I can copy and paste the URL.

24 At a more atomistic level, the Portal provides detailed harmonization information about categories, codes, and labels used in different versions of a variable. In this example APPEAR: (self-rated physical health), we can see the comparability notes warn us that the scale was flipped at time 1 The table shows the codes and labels for each variable and confirms that the baseline measure was indeed reverse-coded. Finally, any time I see a shopping cart icon next to a variable, I can click it - icon turns orange/gold to indicate the variable has been added to my variable basket.

25 Under the Basket menu, the Portal gives me a few options:
I can assign a name to the my custom dataset and download CSV or SPSS. I can download a PDF codebook of the variables I’ve selected. I can even generate the DDI XML file if I’m so inclined.

26 The Variables that are in my basket, be they from different waves and different projects, are automatically merged into one customized dataset. Here is a variable view from SPSS. The Portal creates wide-formatted (flat) dataset. Includes default administrative variables like Respondent and Sample Identifiers

27 I can also download a customized DDI codebook to accompany my custom dataset. It includes important information about versioning, provenance, harmonization, and other variable-level metadata that are not included in the customized datasets.

28 The MIDUS Colectica Portal http:\\midus.colectica.org

29 Use Case: CLOSER Key strengths of CLOSER:
Multiple longitudinal samples Multiple cohorts (1930 – present) Biomedical & Social Science Products: ~ 150,000 questions ~ 250,000 variables ~ 300 datasets Metadata only platform Full Questionnaire flow and contents Cross-cohort comparison and harmonisation

30 Use Case: CLOSER - Scope

31 Use Case: CLOSER - Questions

32 Use Case: CLOSER - Data

33 A derived (composite) variable

34 Derived Variable has a lineage

35 Classification management

36 Platform agnostic description
Use Cases Harmonisation Common code base from same metadata Platform independence Reproducibility of outputs

37 DDI Use Case with a Data Repository

38 Use Case: ICPSR

39 Use Case: ICPSR Key characteristics of ICPSR:
One of the world’s oldest and largest social & behavioral science data archives, established in 1962 760+ members around the world Data dissemination for more than 20 federal and non-government sponsors 300,000+ unique Web visitors per year 10,000+ data collections

40 Use Case: ICPSR Particular benefits of DDI-Codebook for ICPSR:
Archive driven by metadata standards: Information is consistently described Straightforward search and discovery The same information can be re-used in different ways Transportable information for use by different organizations Study-level and variable-level metadata DDI metadata drives site functionality

41 Study-level DDI Elements
Title, Alternate Title Study Number Principal Investigator Funding Bibliographic Citation Series Information Summary Subject Terms Geographic Coverage Time Period Date of Collection Unit of Observation Universe Data Type Sampling Weights Mode of Collection Response Rates Extent of Processing Restrictions Version History Time Method (e.g., longitudinal) Data Method (e.g., qualitative)

42 Study-level DDI leveraged in several ways
Search Forms basis of ICPSR search Repurposing Record is reused across ICPSR’s topical archive sites Interoperating Records shared with other archives Study Overview Becomes PDF overview bundled with each download

43 Variable-level DDI Elements
Variable group reference Variable name and ID Variable label Descriptive variable text Question text Category label and value (responses) Category statistics (frequencies) Summary statistics Notes

44 Variable-level DDI - leveraged in several ways
Search Permits search of variables in a dataset Search across ICPSR Serves as foundation for Social Science Variables Database Codebook with frequencies Enables generation of PDF documentation

45 DDI Takeaways Improve data’s reuse factor
Consistently document data using DDI Reduction in manual processes Increases accuracy Reduces costs in time and money One DDI document → multiple uses Enabling distributed data collection and research processes Across different platforms and systems Between different organizations and researchers Increased quality of documentation Raises visibility of needs and gaps Supports better understanding of data products and data collection processes New tools easily built to address different problems across the research data lifecycle

46 DDI Website Learn how to get started with DDI:

47 Thank you! For more information, questions, ...
Barry Radler Jared Lyle Jon Johnson

48 Slides not used...

49 What DDI provides… Capture what was intended
What: what data were captured and why Capture exactly what was used in the survey implementation How: the mode, logic employed and under what conditions Specify what the data output will be That is, mirrors what was captured and its source Keep the connection Between the survey implementation through to the data received -> data management by PIs -> to archiving Generalised solution So that is can be actioned efficiently and is self-describing So that it can be rendered in different forms for different purposes To do something with this, we need ……

50 …and a framework to do this
Data Cleaning, Labeling, And Transformations Documentation, READMEs, Descriptions (non-dataset or variable) Methodology and Instrument Design Descriptive information for reuse and discovery Instrument Fielding and Data Collection


Download ppt "Improving Data Discoverability and Interoperability with DDI Metadata"

Similar presentations


Ads by Google