Download presentation
Presentation is loading. Please wait.
Published byBryan Kirk Modified over 10 years ago
1
Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org http://www.opendatafoundation.org
2
Open Data Foundation – IZA 2007/11 Outline RDC Needs Metadata in RDCs Potential solutions Examples Conclusions / Q&A
3
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 RDC Overview Provide an environment for the researcher to perform the in depth analysis of data in the most efficient way Simple access to data file and codebook is insufficient Need a high quality metadata and collaborative environment to promote dynamic research Should capture the research process Provide benefits to all stakeholders: producers, librarians, researcher, general public, etc.
4
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Metadata and the survey life cycle A survey is not a static process It dynamically evolved across time and involves many players It extends to aggregate data to reach decision makers Metadata is crucial to capture knowledge
5
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Importance of metadata Imagine a world without metadata…. Users would say: –I cant find the right data! How do I get access? –Where is the report / questionnaire / methodology? –I dont understand this survey / file / variable –I cant merge the files –How do I weight the data? –My results dont match the report, I cant reproduce the same results –Are these things comparable? –I didnt know someone did this research before? Sounds familiar? –Metadata is an answer to a researchers frustrations Producers and archivists are making efforts to improve metadata but similarly, metadata must also be captured by researchers (Life Cycle!)
6
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 When to capture metadata? Metadata must be captured at the time the event occurs! Documenting after the facts leads to considerable loss of information This is true for producers and researchers
7
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Metadata and the Replication standard Replication standard –Gary King, Harvard, 1995 http://gking.harvard.edu/projects/repl.shtml –"The replication standard holds that sufficient information exists with which to understand, evaluate, and build upon a prior work if a third party can replicate the results without any additional information from the author." –The only way to understand and evaluate an empirical analysis fully is to know the exact process by which the data were generate –Replication dataset include all information necessary to replicate empirical results Metadata crucial to meet the standard –Composed of documentation and structured metadata –Undocumented data is useless
8
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 RDC issues Without producer metadata –researchers cant work discover data or perform efficient work Without researcher metadata –producer dont know about data usage and quality issues –Other researcher are not aware of what has been done Without standards –Information cant be properly managed and exchanged between agencies or with the public Without tools: –Cant capture and preserve/share knowledge
9
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 RDC Data RDC Metadata Framework Producers Researcher Producer/Archive Metadata Research Metadata Research Output Public Use metadata External users 1. Producer provide data & basic docs 2. Need to enhance existing metadata 3. Start capturing researcher metadata 4. Knowledge grows and gets reused 5. Provides usage and quality feedback to producer / RDC 6. Repeat across surveys/topics 7. Metadata facilitates output 8. Public metadata facilitates data discovery / fosters global knowledge 9. Metadata exchange between agencies
10
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 RDC Solutions Metadata management –Adopt standards and provide researcher with comprehensive metadata –Use related tools to capture research process Collaborative environment –Used web technologies to foster a dynamic research environment Connected and Remote enclaves –Connect RDCs through secure networks –Consider virtual data enclave Data disclosure –Protect respondent through sound data disclosure techniques Train providers / researchers
11
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Simple techniques Starts with good practices –File and variable naming conventions (embed metadata) –Code documentation –Good statistical methods Web tools –Take advantage of common web technologies –Organize: calendar, events & news, task/todo –Knowledge capture/sharing: shared document/script libraries, wiki, blogs, discussion groups, citation bases, etc.
12
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Coding and naming conventions (1) Give meaningful names to files –Avoid spaces in names, dont use upper case –Version your files (capture progress) –Use middle extensions –Include metadata in the name Not too good: –report.doc, notes.txt – myfile.dta, table2.xls –reg.do, test.do,, results. Better –usda_arms_2005_final_report_v200607.doc –usda_arms_results_v200706.dta, usda_farms_by_crop.xls, –income_regression_v200706.do
13
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Coding and naming conventions (2) Give meaningful names to variables –Not too good: tmp3, ag_exp2, v324 –Better: valid_enterprise, agricultural_expenditure, s1q3 Avoid complex code Comments, comments, comments!! –Make sure to include lots of comments in your source code –This is the best time to capture knowledge! –It also promotes replicability and will help you in a few months when to try to remember what you did Share source code, use peer review
14
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Not so good code example local mypath = c:\data\anonymization\" global data_in = "`mypath'" + "\" + "Demohh1000.dta" global data_out = "`mypath'" + "\" + "Demohh1000.out.dta" global threshold = 0.8 cd $mypath set more off use $data_in, clear tempfile temp gen fk=1 gen wi=weight collapse (sum) fk wi, by (town province marstat sex age) gen pk=fk/wi gen qk=1-pk gen rk= (pk/qk) * log(1/pk) if fk==1 replace rk= (pk/(qk^2)) * ((pk*log(pk))+qk) if fk==2 replace rk=(pk/(2*(qk^3))) * (qk*(3*qk-2) - (2*pk^2)*log(pk)) if fk==3 #delimit ; replace rk= (pk/fk) * (1+ (qk/(fk+1)) + ((2*qk^2) / ((fk+1)*(fk+2))) + ((6*qk^3) / ((fk+1)*(fk+2)*(fk+3))) + ((24*qk^4) / ((fk+1)*(fk+2)*(fk+3)*(fk+4))) + ((120*qk^5) / ((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5))) + ((720*qk^6) / ((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5)*(fk+6))) + ((5040*qk^7) / ((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5)*(fk+6)*(fk+7)))) if fk>3 ;
15
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Better code example /** * Computes the disclosure risk at individual level * * @author John Anonymous (janon@example.org)janon@example.org * @version 2007.06 * References: * - micro-Argus 4.1 manual, p27-25 */ // Configuration local mypath = C:\data\anonymization\" global data_in = "`mypath'" + "\" + "Demohh1000.dta" global data_out = "`mypath'" + "\" + "Demohh1000.out.dta" global threshold = 0.8 // Initialize cd $my_path set more off // Load the data use $data_in, clear tempfile temp
16
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Canada RDC Project Consists of 14 Research Data Centres Centres, 6 branch RDCs and the Federal Research Data Centre in Ottawa Data provided by Statistics Canada RDC are now connected through a high speed secure network Project to adopt a DDI 3.0 based metadata framework for survey documentation and research work and sponsor development of tools ODaF providing technical assistance http://www.statcan.ca/english/rdc/index.htm
17
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Project Application Project Approval Project Creation Access to Data Generate Analysis Files Output Disclosure Analysis Research Commun- icatons Stages in the life cycle The Canada RDC Research Life Cycle [Chuck Humphrey, University of Alberta] Managing Data Stages
18
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Metadata in Canada RDC RDC Producer Analyst Researcher Original Survey Master Survey Virtual Survey Research Output Security Other researchers Policy Makers General Public … Publication Conferences … Security 1.Producer makes survey available 2.Analyst packages for RDC 3.Researcher gets access and reshapes the data 4.Researcher perform complex analysis 5.Researchers publishes results 6.Information flowing in/out and activities are controlled and monitored 7.Outside users get access to the research output 8.Analyst includes results, activity, feedback and reports to the producer The information flow relies on metadata and also generates new information that must be captured!! 1 23 4 5 66 78 8
19
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 MASTER VIRTUALOUTPUTORIGINAL Repurpose Disclosure Tables Other Version Log Group Metadata ManagementVirtual File System StorageQueryRegistryExchangeDataFiles Security Authorization Authentication i18n Analysis Report Metadata Mining Compare 2.0Editor Question Quality Concepts Resources Legacy SPSS, SAS, Stata 2.0 / 3.0DDI 3.0 Project Admin Audit Logs Communication Collaborative Intranet Training Documentation Original Survey Master Survey Virtual Survey Research Output Publication Conferences … Metadata Framework in Canada RDC
20
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 NORC Data Enclave National Opinion Research Center provides a secure environment within which authorized researchers can access sensitive microdata remotely from their offices or onsite Data from National Institute for Standards and Technologys (NIST) Technology Innovation Program (TIP), the Ewing Marion Kauffman Foundation, and the Economic Research Service at the US Department of Agriculture Possibly the first virtual data enclave http://dataenclave.norc.org
21
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 NORC Virtual Enclave
22
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Benefits (1) Data documentation –Through good metadata practices, comprehensive documentation is available to the researchers Preservation, integration and sharing of knowledge –Research process is captured and preserved in harmonized format –Research knowledge becomes integrant part of the survey and available to others –Producer gets feedback from the data users (usage, quality issues) –Reduce duplication of efforts and facilitates reuse
23
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Benefits (2) Research outputs and dissemination –Facilitate production of research outputs –Facilitate dissemination and fosters broader visibility of research outputs Exchange of information –Metadata exchange between RDC, producers, librarians –Importance of public metadata for sensitive datasets –Facilitate data discovery (inside and outside RDC)
24
http://www.opendatafoundation.orgOpen Data Foundation – IZA 2007/11 Conclusions Metadata plays a crucial roles in RDCs Benefits all stakeholders –Better use of the data (return on investment) –Improves research quality –Foster production of high quality data (more relevant and accurate) accompanied by comprehensive metadata Adopting good practices may mean changing the way you work –This requires good change management techniques and discipline –But the benefits are worth the effort
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.