The implementation of a more efficient way of collecting data Prof. Ene-Margit Tiit Heli Jaago Senior Methodologist Leading Methodologist Methodology department Methodology Department
Our general aims: Development of estimation methods for SBS investments and financial leasing variables; Elaboration of target for data load programs (Extract, Transform, and Load process for data warehouses –ETL) Creating and developing of warehouses metadata (repository) and normalized data model for data collection from businesses. 3/12/18
Project overview (1) analysis and technical specifications of administrative data sources general description of Metadata Framework (data about describing of data) data warehouse dictionary - metadata normalized data model as the main structure for administrative data repository data load programs (Extract, Transform, and Load process for data warehouses -ETL) 3/12/18
Project overview (2) The improvement of the existing estimation methods for the production of the SBS investment and financial leasing variables. Creation of statistical models for estimating the distribution of investments in small enterprises (with less than 20 persons) The increased usage of administrative data for small enterprises with help of statistical models created. As a result the need for collecting empirical data will decrease. 3/12/18
The project is divided into two sub-projects development of estimation methods for SBS investments and financial leasing variables Project leader Prof. Ene-Margit Tiit) creating of metadata warehouse (repository) of collected (meta)data Project leader leading methodologist Heli Jaago 3/12/18
Prof. Ene-Margit Tiit Main target of the project: To develop the estimation methods for the SBS investment and financial leasing variables, i.e to create the statistical models for the breakdown of investments by kind of fixed assets. In the project we consider very small (employed <10 persons) and small (employed 10—19 persons) enterprises in areas of manufacturing, construction, trade and real estate.
Data For solving the task we have a sample gathered in years 2000—2006, totally 18 235 enterprises (about 3000 per year). In general, each enterprise represents 4—5 enterprises of the total population. From administrative sources it is possible to get information about their gross investments (E15000).
The components of investments to be estimated are the following: E15120 Gross investment in land E15130 Gross investment in existing buildings and structures E15140 Gross investment in construction and alteration of buildings E15150 Gross investment in machinery and equipment E15440 Gross investment in intangible goods.
To estimate the structure of investment the following ratios were calculated: y1= E15120/E15000, y2= E15130/E15000, y3= E15140/E15000, y4= E15150/E15000, y5= E15440/E15000 for each unit of the sample, where 0 ≤ yi ≤ 1; (1) y1 + y2 + y3 + y4 + y5 = 1. (2) The task is to create a model for the 5-dimensional vector Y = (y1, y2, y3, y4, y5).
Grouping the data The enterprises are grouped into 8 assumingly homogeneous groups. Size of enterprise <10 10—19 Manufacturing 2883 2086 Construction 983 889 Trade 4683 2346 Real estate 4021 344 In each group the structure of investments is somewhat different, soit is possible that the models have different parameters in different groups.
Possible types of model In principle, there are two main ways to create a model: Regression-type or prognostic models, where the values of the predictable variable are calculated by known values of explanatory or background variables. Simulation models, where the model consists of random numbers having the same distribution as the predictable variable or vector.
Choice between the types of model The regression-type models are useable only in the case when there exist statistical dependencies between measurable explanatory variables and predictable variables. In this task all dependencies between investment’s ratios and background variables (year, size of investment, number of employees etc) and also the description rates R2 of models were quite small, see the following Figure.
Average description rate of regression models for investment’s ratio was 6,1%
Creating a simulation model As the description rate of regression-type models is too small, it is rational to use simulation models. That means, it is necessary to create a series of 5-dimensional random vectors Y = (y1, y2, y3, y4, y5) the distribution that is having similar to the empirical distribution of the sample. The starting point is studying the structure of investments in the sample
Structure of investments in an enterprise It became evident that in most cases small enterprises have concentrated their investments into one sphere, that means, in many cases most components of the investments vector Y equal to zero. From all 31 possible combinations (5 single-component structures, 10 two-component structures, 10 three-component structures, 5 four-component structures and one structure with all 5 non-zero components about ¾ formed one-component structures, where the only non-zero ratio was equal to one, see the following Table.
The most frequent combinations of investments machinery 68,72 Construction, machinery 9,08 77,8 Buildings, machinery 3,66 81,46 Land, machinery 3,04 84,5 Construction 2,77 87,27 Machinery, intangible goods 2,36 89,63 Buildings 1,73 91,36 Land, construction, machinery 1,69 93,05 Land 1,44 94,49 Land, buildings, machinery 1,25 95,74 Buildings, construction, machinery 1,03 96,77
Modeling the structure of investments The simulation consists of 2 modelling steps. The structure of a random vector is modelled (using multinomial distribution and empirical probabilities); If the vector contains only one non-zero component, then this component equals to 1 and the others are equal to 0; If the vector contains several non-zero components, then their values are simulated (using either Normal, Beta or Uniform distribution) checking the conditions (1) and (2) are fulfilled; The parameters of these distributions are estimated by sample data.
Checking the model The model was checked using the data from 2007. The distributions of simulated data and empirical data were compared, the results were satisfactory.
Heli Jaago (Leading methodologist) * Over repeat– what is Metadata?* Metadata – data about data Concepts. Definitions.Data Processing rules.References. Classifications (description of structure, versions) XML-based ontologies (XBRL, SDMX, HL7 etc) Workflow descriptions Data model specifications Informational system specification 3/12/18
Metadata without metadata models
META-METAMODEL allows to describe the metamodel METAMODEL describes the data model objects and their relationships between DATA MODEL describes the data and the links between DATA describe a real-world objects
MMX Framework Architectural solution to the metadata Interpretation of Statistics Estonia MMX Framework Architectural solution to the metadata Neuchâtel model Allows the description of statistical acitivities Description of the survey sample, time-period, collected variables, statistical indicators Survey data Number, values, decimal points,..
Neuchâtel model variable system Key words: Subject area Concept family, conceptual variable Statistical activity, statistical activity instance Statistical unit type Statistical characteristic Variable (contextual variable) Measurement unit type 3/12/18
User interface- MMXMetadata Navigator This is video, wait a little bit or – click F5 on your keyboard if it doesn`t move 3/12/18
Related objects in Metadata Navigator This is video, wait a little bit or – click F5 on your keyboard if it doesn`t move 3/12/18
MMX Metadata Navigator Key points: Completely metadata driven Metamodel is not hard-coded into application New metamodels (classifications, workflows, etc) require no changes in the application Always in context of a single metadata object Full context of the object (details, properties, relations) is visible Simple navigation links to any related object via a single click 3/12/18
Why Metadata Repository and not Wiki? Structured not based on free text: formalized, can be queried Capturing of constraints and business rules possible Associations carry rich semantics and are not merely navigational links Can be linked to other systems (SQL for database apps, RDF etc. For semantic web apps) 3/12/18
What is MMX? Data layer (data model, database objects and APIs) Metamodels, methodology for creating and implementing them in data layer Technological stack (ORM, Application server, AJAX, ...) for creating Web applications based on data layer and metamodels Experience in creating and deploying such applications in specific customer environments 3/12/18
3/12/18