John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio
Maximizing the potential of LTER data to be used to make new ecological discoveries Moving from the era of single datasets to large scale data integration Tens to hundreds of datasets A first step to achieving this goal is to automate the mechanical processes associated with data ingestion into analytical software
We want to: Identify a dataset in the LTER Network Information System Download it Write a R statistical program to read the data Produce basic statistical summaries of the ingested data How long should that process take? With our tools we can do that in less than 1 minute!
ToolDescriptionWorks with Metacat Works with PASTA TFRI – R module Web-form-based system takes you through a multistep process to ingest data, do a basic quality assurance analysis and simple analyses Manual data download StatProg Web-form-based system that generates R, SAS, SPSS or Matlab programs that can be edited to process data Manual data download PASTAprog Web service – returns ready-to-use R, SAS or Matlab program. Can be run directly from inside R for 1-minute analyses! Variable – some automated, some manual Fully- automated download
Note: You do NOT need to have R installed on your PC to use this. It is entirely web-based. Don’t be worried by the buttons! A fully English version is available at the URL above
Metadata Display Statistical Functions Raw Data Upload Select number type of the field Incude the field in R code ( select at least one ) ˇ EML metadata transform into HTML by XSL Stylesheet
No field header Upload
Only for numerical attributes! Data Check Functions Correct domain (real, integer) Range Checks Action Options: Edit records with bad values Set all the bad values to missing ( NA ) Eliminate all the records with bad values Ignore all the range check problems (Just for value range error)
Data Type Error : Value Range Error : Select 'Set all the bad values to missing ( NA )' option 3 Update The message for No data error
This line can not be modified Rest of the R program CAN be modified to reflect your analyses
Select program type Specify Metadata Document to Use You can get the Package ID from the LTER Metadata catalog. Download a copy of the data, while you are there! Or, you can specify a metadata document on a site server by giving the full URL
Importantly, you need to edit the program to point to where the data is stored on YOUR computer, so the program can find it!
The previous form-based programs have been available for several years Their performance has improved as Metadata has gotten better But they still can be slower to use than we would like, requiring manual editing and steps The advent of the LTER PASTA system makes possible truly automated ingestion and analysis using a web service
R “source” function specifying the web service URL and that we want to “echo” our commands to the screen Package ID from the PASTA Data Portal
DONE! Our analysis has been run, and basic statistical summaries have been created for each of the attributes.
You can now add additional commands to generate graphics etc. or merge to other datasets
Base URL: Plus – a Package ID (available on the PASTA portal) E.g., knb-lter-vcr Scope: knb-lter-vcr ID: 26 Revision: 14 Plus – A suffix indicating the type of program you want (e.g.,.r,.sas,.spss,.m) for R, SAS, SPSS or Matlab
knb-lter-vcr r You can also use the web service URL in a web browser to get a text copy of your program Note: There are other options that will let you use the web service for data OUTSIDE PASTA by specifying the URL of the EML metadata separately
Problems with Metadata Lead to lack of congruency between the description of the data and the data itself* Bad practices in metadata - e.g., using special characters, spaces or mathematical operations as part of the attribute names Links to data in the metadata may not properly lead directly to data * Problems with Data Inconsistent coding (character data where numbers are expected) – causes conversion of numerical data into R “factors” Dates – often are handled in different ways ????? – these systems need additional testing on a wide array of data – and you can help! * Much improved by PASTA system over earlier Metacat