Ioannis Xirouchakis / Unit B3 Analysis of aggregations on-the-fly Joint meeting of the ESS.VIP.BUS ICT Project and the ESS.VIP IT TF Luxembourg, 3 March 2014 Ioannis Xirouchakis / Unit B3
Completed activities Analysis of on-the-fly aggregation of statistics Design of scenarios and tests for extending the ICT Hub to support on-the-fly aggregations Testing of the scenarios Preliminary analysis of results Proposal Outlining the Enhanced Hub (as an extension to the Census and ICT Hubs)
Aggregation on-the-fly Under the Hub logic Each country node contains data relevant solely to that country Aggregates would need to be computed centrally As data are not static Countries can modify country data at any time Aggregates would need to be computed on-the-fly on an ad hoc basis and upon receipt of a relevant end-user query
An EU28 aggregate on the Hub Upon receipt of an end-user query The Hub dispatches 28 requests to country nodes Collects 28 (?) answers Calculates the aggregate (derivative quantity) Presents the result to the end user Challenges Calculation depends on country node availability Additional information may be required The execution of complex calculations may be necessary
Scenario 1: Available countries only Algorithm - upon receipt of end-user request The Hub sends N (e.g. 28) data requests M responses received within given timeout The Hub calculates aggregate for M countries Main challenges Are these M countries sufficient to provide EU28 aggregate? What if the next end-user request receives a different result (M* countries now available)? Methodological treatment of temporarily unavailable country node / result
Scenario 2: Periodic data pulling Algorithm - upon receipt of end-user request The Hub receives M responses for N requests Local copy utilized for unavailable countries The Hub calculates aggregate for all N countries Rationale Aggregate calculated over N countries Functionality required vs. Scenario 1 Periodic offline data pulling Preservation of latest available data version centrally
Scenario 3: Data versioning 1/2 Algorithm - upon receipt of end-user request The Hub sends N(=28) data version inquiries For A(=20) inquiries answered, the Hub compares data version with local copy For S(=17) found with same data version, the Hub uses local copy For all other P(=3) the Hub sends data requests and receives M(=2) responses Local copy also utilized for unanswered data version inquiries and data requests (=8+1) Local copy for N-A+S+P-M =26 nodes
Scenario 3: Data versioning 2/2 Rationale As few data as possible pulled on-the-fly No data pulled if they exist already Functionality required vs. Scenario 2 Consistent data versioning Version inquiry
Scenario 4: Pre-calculated aggregate Extension to algorithm of Scenario 3 When for all data version inquiries answered (A) all data versions are up-to-date (S=A) Then no data requests need to be sent And a pre-calculated aggregate could be used Rationale No complex calculations on-the-fly Functionality required Offline calculations after each data pull
Country availability considerations Availability of Hub service Hub availability aH to the end-user Guaranteed by a hosting service, e.g. at Platinum 99.99%, Gold 99.5%, Silver 99% and Bronze 98.5% level Not of concern for calculations on-the-fly Country availability a to the Hub a = aC x aN aC: Availability of hardware & software on the node. Can be guaranteed by a hosting service. aN: Availability of network path between node and Hub. Can be estimated using network statistics.
Results for Scenario 1 1/2 ICT aggregate methodologically acceptable For nodes available > 55% (16+ for EU28) When population of nodes available > 60% Simulation results Out of 228 possible combinations, 84% rejected However, their total probability is small ~10-5 for country availability a=98.5% (Bronze) ~10-6 for a=99% (Silver) ~10-7 for a=99.5% (Gold) ~10-12 for a=99.99% (Platinum hosting package)
Results for Scenario 1 2/2 aC aN=100% aN=95% aN=90% aN=80% Platinum 10-12 10-4 0.6% 6.7% Gold 10-7 0.7% 7.2% Silver 10-6 10-3 0.8% 7.7% Bronze 10-5 1.0% 8.2% Example For country node availability at Bronze level (98.5%) and network availability of 90%, the probability of rejection of the aggregate is 1% Indicatively, Bronze level is the lowest/cheapest availability level proposed by contractors to Eurostat for hosting A 90% network path availability would mean that 1 out of 10 requests to country endpoints faces a network problem/congestion
Performance for Scenarios 1 and 2 In the best case (all nodes available) It takes tD tD = data retrieval time With tD < tO indicative timeout tO=20 sec In the worst case (1 or more nodes unavailable) It takes tO Scenario 2 In the best case, it is Scenario 1 It takes tO+tL for small tL = local data retrieval time
Performance for Scenario 3 1/2 The version inquiry Takes tV tV = version retrieval time With tV < tU indicative version timeout tU=2 sec If all nodes found unchanged during version inquiry And all available (best case) It takes tV+tL for small tL = local data retrieval time If 1 (or more) unavailable It takes tU+tL
Performance for Scenario 3 2/2 If some nodes found changed during version inquiry, a data request is performed which Takes tD tD = data retrieval time With tD < tO indicative timeout tO=20 sec For the nodes found changed If all nodes available during the data request It takes tV+tD OR tU+tD If 1 (or more) unavailable It takes tV+tO+tL OR tU+tO+tL (worst case)
Examples for Scenario 3 Best case example Worst case example At tV the version inquiry finds all nodes available and unchanged At tV+tL latest data versions loaded from local copy Worst case example At tU the version inquiry returns a timeout while it finds some of the available nodes changed Data request sent to available nodes / Latest data versions loaded from local copy for unavailable nodes At tU+tO the data request returns a timeout At tU+tO+tL latest data versions loaded from local copy for unavailable nodes
Expected performance results Best case Worst case Often Scenario 1 tD+tA tO+tA (tO>tD) tO+tA Scenario 2 tO+tL+tA (tL«tD) tO+tL+tA Scenario 3 tV+tL+tA (tV«tD) tU+tO+tL+tA (tU>tV) tU+tD+tA Scenario 4 tV+tL tU+tO+tL tU+tD tD = data retrieval time tO = data request timeout tV = version inquiry time tU = version inquiry timeout tL = local data retrieval time tA = aggregations time
Preliminary conclusions 1/5 Scenario 1 pros Already implemented for the Census Hub Requires no further investment Scenario 1 cons High probability of (a few) unavailable nodes Aggregate calculated over available nodes only Discussion Most probably there will always be sufficient nodes available to calculate aggregrate The result will vary as nodes become available/unavailable
Preliminary conclusions 2/5 Scenario 2 pros Aggregate calculated over all nodes Logic similar to Scenario 1 Scenario 2 cons Investment for periodic data pulling Central database to maintain Can be slower than Scenario 1 Discussion Aggregrate calculation always possible The result will not (really) vary as nodes become available/unavailable
Preliminary conclusions 3/5 Scenario 3 pros Aggregate calculated over all nodes Can be faster than Scenario 2 for frequent data pulls or large data sets (responses) Scenario 3 cons Investment for consistent data versioning Can be slower than Scenario 2 for non-frequent data pulls or small data sets (responses) Discussion Data versioning can be useful overall Interesting for large data sets
Preliminary conclusions 4/5 Scenario 4 pros Aggregate calculated over all nodes Faster than Scenario 3 Scenario 4 cons Aggregates calculation after every data pull Useful only in the case of zero changed nodes Discussion Interesting for old/static collections Depending on the computed aggregate (derivative quantity in general), aggregation on-the-fly can be complex
Preliminary conclusions 5/5 For ICT test data From 11 pilot countries, 9 endpoints provided, with 7 being practically stable Scenario 1 Found always some unavailable nodes Calculated always a statistically acceptable aggregate Scenario 2 Run faster than any other scenario
Proposed improvements Proposed conditions for more indicative results More (ideally 28) endpoints available Production (versus test) endpoints Higher endpoint availability Production-level infrastructure on country nodes Production-level infrastructure on central node Investment for extended functionality Data pulling, version checking, aggregations Various statistical domains to be examined For example, result sets in Census domain are significantly larger and contain more dimensions
The Enhanced Hub Proposed as an extension to the Hub logic to 'integrate' various online (hub) and offline (batch) functions to 'synchronize' data versions in an ESS Data Warehouse to 'serve' in parallel statistical domains with different needs
Ideas behind the Enhanced Hub Services are available Online: for on-the-fly calls from Hub GUI(s) Offline: for periodic/on-demand batch calls Services examples Data pulling: pulls data from one location into another exploiting a mapping tool (SDMX-RI) Version checking: ensures that consistent data versions exist in different locations Data calculations: performs aggregations, derivations, validation Orchestration
Conclusions With the work package complete All objectives of the work package were achieved Alternative scenarios proposed for extending the Hub to support on-the-fly aggregations Preliminary tests indicate Scenario 2 as best suited for ICT data Scenarios 1, 3 and 4 better suited for other statistical domains The Enhanced Hub could potentially meet the needs of any statistical domain
Thank you for your attention! Contact: Ioannis.Xirouchakis@ec.europa.eu Unit B3, Eurostat