Ioannis Xirouchakis / Unit B3

Slides:



Advertisements
Similar presentations
1 SDMX Reference Infrastructure (SDMX-RI) Work in progress, status and plans Bengt-Åke Lindblad, Adam Wroński Eurostat Eurostat Unit B3 – IT and standards.
Advertisements

Distributed Database Management Systems
Background Data validation, a critical issue for the E.S.S.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
The ESS.VIP Programme: a response to the challenges facing the ESS Mariana Kotzeva, ESS VIP Programme Coordinator Advisor Hors Classe ESTAT.
ARRA Reporting – Upcoming Initiatives Office of Contracts.
Distributed Database Systems Overview
Travel Time Value Calculator: The Development of an Analysis Utility in Cube/Voyager.
ESS-VIP ICT Project Work Package III Task Force Meeting, Luxembourg, 5 March 2013.
1 SDMX Global Conference September 2015 SDMX into the future VTL (Validation and Transformation Language) A new technical standard for enhancing.
Work Session on Statistical Metadata 2013 Session III: Metadata in the Statistical Business Process Better documenting statistical business processes:
m-Privacy for Collaborative Data Publishing
Level 1-2 Trigger Data Base development Current status and overview Myron Campbell, Alexei Varganov, Stephen Miller University of Michigan August 17, 2000.
MANAGEMENT INFORMATION SYSTEM
PART1 Data collection methodology and NM paradigms 1.
Do SKU and Network Complexity Drive Inventory Levels?
The ESS vision, ESSnets and SDMX
Service Challenge 3 CERN
EU-US Open Data project
Modernization Maturity Model
Rudi Seljak, Aleš Krajnc
FEASIBILITY STUDY Feasibility study is a means to check whether the proposed system is correct or not. The results of this study arte used to make decision.
The evolution of the SDMX infrastructure and services
Traceability between SDTM and ADaM converted analysis datasets
Eurostat D2 – Regional Indicators and Geographical Information
Information Systems Development
Anne Pratoomtong ECE734, Spring2002
Implementing the ESS Vision 2020
INSPIRE Geoportal Thematic Views Application
Spatial Online Sampling and Aggregation
ARRA Reporting – Upcoming Initiatives
ESS VIP ICT Project Mapping Assistant in use (ICT domain)
Census Hub in practice Working Group "European Statistical Data Support" Luxembourg, 29 April 2015.
Working Group “Environmental Expenditure Statistics” Preliminary results from the Joint Questionnaire 2006 on Environmental Protection Expenditure.
ESSnet on SDMX phase II Laura Vignola
Measuring Data Quality and Compilation of Metadata
Working Group on Population and Housing Censuses
Working Group on Population and Housing Censuses
Database Systems Chapter 1
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
Computer Architecture
Working Group on Population and Housing Censuses
Census Hub: Progress report
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Overview of big data tools
Eurostat – Units E2, B5 Cristina BLANARU
Ag.no.17.1 Advancing the SRQ collection date
Working Group on Population and Housing Censuses
SDMX in the S-DWH Layered Architecture
EUROSTAT Meeting, Luxembourg, 6/7 December, 2001
"Environmental Expenditure Statistics"
ESS VIP ICT Project Task Force Meeting 5-6 March 2013.
Prepared by Peter Boško, Luxembourg June 2012
Point 6. Eurostat plans for Time Use Survey data processing and dissemination Working Group on Time Use Surveys 10 April 2013.
Ioannis Xirouchakis / Unit B3
Census Hub: current plans and timetable
ESS.VIP Validation Item 5.1
Wellington Cabrera Advisor: Carlos Ordonez
Data integration methods
SDMX Implementation The National Accounts use case
European Census Hub: a cooperation model for dissemination of EU statistics Paper prepared by Ioannis Xirouchakis Presentation: Christine WIRTZ, Eurostat.
Passenger Mobility Statistics 21 May 2015
Joint meeting of the ESS.VIP.BUS ICT Project
Presentation of Project Joint meeting of the ESS.VIP.BUS ICT Project
Future of EDAMIS Webforms
SDMX IT building blocks
Item 9 Validation in UOE data collection
Presentation transcript:

Ioannis Xirouchakis / Unit B3 Analysis of aggregations on-the-fly Joint meeting of the ESS.VIP.BUS ICT Project and the ESS.VIP IT TF Luxembourg, 3 March 2014 Ioannis Xirouchakis / Unit B3

Completed activities Analysis of on-the-fly aggregation of statistics Design of scenarios and tests for extending the ICT Hub to support on-the-fly aggregations Testing of the scenarios Preliminary analysis of results Proposal Outlining the Enhanced Hub (as an extension to the Census and ICT Hubs)

Aggregation on-the-fly Under the Hub logic Each country node contains data relevant solely to that country Aggregates would need to be computed centrally As data are not static Countries can modify country data at any time Aggregates would need to be computed on-the-fly on an ad hoc basis and upon receipt of a relevant end-user query

An EU28 aggregate on the Hub Upon receipt of an end-user query The Hub dispatches 28 requests to country nodes Collects 28 (?) answers Calculates the aggregate (derivative quantity) Presents the result to the end user Challenges Calculation depends on country node availability Additional information may be required The execution of complex calculations may be necessary

Scenario 1: Available countries only Algorithm - upon receipt of end-user request The Hub sends N (e.g. 28) data requests M responses received within given timeout The Hub calculates aggregate for M countries Main challenges Are these M countries sufficient to provide EU28 aggregate? What if the next end-user request receives a different result (M* countries now available)? Methodological treatment of temporarily unavailable country node / result

Scenario 2: Periodic data pulling Algorithm - upon receipt of end-user request The Hub receives M responses for N requests Local copy utilized for unavailable countries The Hub calculates aggregate for all N countries Rationale Aggregate calculated over N countries Functionality required vs. Scenario 1 Periodic offline data pulling Preservation of latest available data version centrally

Scenario 3: Data versioning 1/2 Algorithm - upon receipt of end-user request The Hub sends N(=28) data version inquiries For A(=20) inquiries answered, the Hub compares data version with local copy For S(=17) found with same data version, the Hub uses local copy For all other P(=3) the Hub sends data requests and receives M(=2) responses Local copy also utilized for unanswered data version inquiries and data requests (=8+1) Local copy for N-A+S+P-M =26 nodes

Scenario 3: Data versioning 2/2 Rationale As few data as possible pulled on-the-fly No data pulled if they exist already Functionality required vs. Scenario 2 Consistent data versioning Version inquiry

Scenario 4: Pre-calculated aggregate Extension to algorithm of Scenario 3 When for all data version inquiries answered (A) all data versions are up-to-date (S=A) Then no data requests need to be sent And a pre-calculated aggregate could be used Rationale No complex calculations on-the-fly Functionality required Offline calculations after each data pull

Country availability considerations Availability of Hub service Hub availability aH to the end-user Guaranteed by a hosting service, e.g. at Platinum 99.99%, Gold 99.5%, Silver 99% and Bronze 98.5% level Not of concern for calculations on-the-fly Country availability a to the Hub a = aC x aN aC: Availability of hardware & software on the node. Can be guaranteed by a hosting service. aN: Availability of network path between node and Hub. Can be estimated using network statistics.

Results for Scenario 1 1/2 ICT aggregate methodologically acceptable For nodes available > 55% (16+ for EU28) When population of nodes available > 60% Simulation results Out of 228 possible combinations, 84% rejected However, their total probability is small ~10-5 for country availability a=98.5% (Bronze) ~10-6 for a=99% (Silver) ~10-7 for a=99.5% (Gold) ~10-12 for a=99.99% (Platinum hosting package)

Results for Scenario 1 2/2 aC aN=100% aN=95% aN=90% aN=80% Platinum 10-12 10-4 0.6% 6.7% Gold 10-7 0.7% 7.2% Silver 10-6 10-3 0.8% 7.7% Bronze 10-5 1.0% 8.2% Example For country node availability at Bronze level (98.5%) and network availability of 90%, the probability of rejection of the aggregate is 1% Indicatively, Bronze level is the lowest/cheapest availability level proposed by contractors to Eurostat for hosting A 90% network path availability would mean that 1 out of 10 requests to country endpoints faces a network problem/congestion

Performance for Scenarios 1 and 2 In the best case (all nodes available) It takes tD tD = data retrieval time With tD < tO indicative timeout tO=20 sec In the worst case (1 or more nodes unavailable) It takes tO Scenario 2 In the best case, it is Scenario 1 It takes tO+tL for small tL = local data retrieval time

Performance for Scenario 3 1/2 The version inquiry Takes tV tV = version retrieval time With tV < tU indicative version timeout tU=2 sec If all nodes found unchanged during version inquiry And all available (best case) It takes tV+tL for small tL = local data retrieval time If 1 (or more) unavailable It takes tU+tL

Performance for Scenario 3 2/2 If some nodes found changed during version inquiry, a data request is performed which Takes tD tD = data retrieval time With tD < tO indicative timeout tO=20 sec For the nodes found changed If all nodes available during the data request It takes tV+tD OR tU+tD If 1 (or more) unavailable It takes tV+tO+tL OR tU+tO+tL (worst case)

Examples for Scenario 3 Best case example Worst case example At tV the version inquiry finds all nodes available and unchanged At tV+tL latest data versions loaded from local copy Worst case example At tU the version inquiry returns a timeout while it finds some of the available nodes changed Data request sent to available nodes / Latest data versions loaded from local copy for unavailable nodes At tU+tO the data request returns a timeout At tU+tO+tL latest data versions loaded from local copy for unavailable nodes

Expected performance results Best case Worst case Often Scenario 1 tD+tA tO+tA (tO>tD) tO+tA Scenario 2 tO+tL+tA (tL«tD) tO+tL+tA Scenario 3 tV+tL+tA (tV«tD) tU+tO+tL+tA (tU>tV) tU+tD+tA Scenario 4 tV+tL tU+tO+tL tU+tD tD = data retrieval time tO = data request timeout tV = version inquiry time tU = version inquiry timeout tL = local data retrieval time tA = aggregations time

Preliminary conclusions 1/5 Scenario 1 pros Already implemented for the Census Hub Requires no further investment Scenario 1 cons High probability of (a few) unavailable nodes Aggregate calculated over available nodes only Discussion Most probably there will always be sufficient nodes available to calculate aggregrate The result will vary as nodes become available/unavailable

Preliminary conclusions 2/5 Scenario 2 pros Aggregate calculated over all nodes Logic similar to Scenario 1 Scenario 2 cons Investment for periodic data pulling Central database to maintain Can be slower than Scenario 1 Discussion Aggregrate calculation always possible The result will not (really) vary as nodes become available/unavailable

Preliminary conclusions 3/5 Scenario 3 pros Aggregate calculated over all nodes Can be faster than Scenario 2 for frequent data pulls or large data sets (responses) Scenario 3 cons Investment for consistent data versioning Can be slower than Scenario 2 for non-frequent data pulls or small data sets (responses) Discussion Data versioning can be useful overall Interesting for large data sets

Preliminary conclusions 4/5 Scenario 4 pros Aggregate calculated over all nodes Faster than Scenario 3 Scenario 4 cons Aggregates calculation after every data pull Useful only in the case of zero changed nodes Discussion Interesting for old/static collections Depending on the computed aggregate (derivative quantity in general), aggregation on-the-fly can be complex

Preliminary conclusions 5/5 For ICT test data From 11 pilot countries, 9 endpoints provided, with 7 being practically stable Scenario 1 Found always some unavailable nodes Calculated always a statistically acceptable aggregate Scenario 2 Run faster than any other scenario

Proposed improvements Proposed conditions for more indicative results More (ideally 28) endpoints available Production (versus test) endpoints Higher endpoint availability Production-level infrastructure on country nodes Production-level infrastructure on central node Investment for extended functionality Data pulling, version checking, aggregations Various statistical domains to be examined For example, result sets in Census domain are significantly larger and contain more dimensions

The Enhanced Hub Proposed as an extension to the Hub logic to 'integrate' various online (hub) and offline (batch) functions to 'synchronize' data versions in an ESS Data Warehouse to 'serve' in parallel statistical domains with different needs

Ideas behind the Enhanced Hub Services are available Online: for on-the-fly calls from Hub GUI(s) Offline: for periodic/on-demand batch calls Services examples Data pulling: pulls data from one location into another exploiting a mapping tool (SDMX-RI) Version checking: ensures that consistent data versions exist in different locations Data calculations: performs aggregations, derivations, validation Orchestration

Conclusions With the work package complete All objectives of the work package were achieved Alternative scenarios proposed for extending the Hub to support on-the-fly aggregations Preliminary tests indicate Scenario 2 as best suited for ICT data Scenarios 1, 3 and 4 better suited for other statistical domains The Enhanced Hub could potentially meet the needs of any statistical domain

Thank you for your attention! Contact: Ioannis.Xirouchakis@ec.europa.eu Unit B3, Eurostat