Ioannis Xirouchakis / Unit B3

Ioannis Xirouchakis / Unit B3
Analysis of aggregations on-the-fly Joint meeting of the ESS.VIP.BUS ICT Project and the ESS.VIP IT TF Luxembourg, 3 March 2014 Ioannis Xirouchakis / Unit B3

Completed activities Analysis of on-the-fly aggregation of statistics
Design of scenarios and tests for extending the ICT Hub to support on-the-fly aggregations Testing of the scenarios Preliminary analysis of results Proposal Outlining the Enhanced Hub (as an extension to the Census and ICT Hubs)

Aggregation on-the-fly
Under the Hub logic Each country node contains data relevant solely to that country Aggregates would need to be computed centrally As data are not static Countries can modify country data at any time Aggregates would need to be computed on-the-fly on an ad hoc basis and upon receipt of a relevant end-user query

An EU28 aggregate on the Hub
Upon receipt of an end-user query The Hub dispatches 28 requests to country nodes Collects 28 (?) answers Calculates the aggregate (derivative quantity) Presents the result to the end user Challenges Calculation depends on country node availability Additional information may be required The execution of complex calculations may be necessary

Scenario 1: Available countries only
Algorithm - upon receipt of end-user request The Hub sends N (e.g. 28) data requests M responses received within given timeout The Hub calculates aggregate for M countries Main challenges Are these M countries sufficient to provide EU28 aggregate? What if the next end-user request receives a different result (M* countries now available)? Methodological treatment of temporarily unavailable country node / result

Scenario 2: Periodic data pulling
Algorithm - upon receipt of end-user request The Hub receives M responses for N requests Local copy utilized for unavailable countries The Hub calculates aggregate for all N countries Rationale Aggregate calculated over N countries Functionality required vs. Scenario 1 Periodic offline data pulling Preservation of latest available data version centrally

Scenario 3: Data versioning 1/2
Algorithm - upon receipt of end-user request The Hub sends N(=28) data version inquiries For A(=20) inquiries answered, the Hub compares data version with local copy For S(=17) found with same data version, the Hub uses local copy For all other P(=3) the Hub sends data requests and receives M(=2) responses Local copy also utilized for unanswered data version inquiries and data requests (=8+1) Local copy for N-A+S+P-M =26 nodes

Scenario 3: Data versioning 2/2
Rationale As few data as possible pulled on-the-fly No data pulled if they exist already Functionality required vs. Scenario 2 Consistent data versioning Version inquiry

Scenario 4: Pre-calculated aggregate
Extension to algorithm of Scenario 3 When for all data version inquiries answered (A) all data versions are up-to-date (S=A) Then no data requests need to be sent And a pre-calculated aggregate could be used Rationale No complex calculations on-the-fly Functionality required Offline calculations after each data pull

Country availability considerations
Availability of Hub service Hub availability aH to the end-user Guaranteed by a hosting service, e.g. at Platinum 99.99%, Gold 99.5%, Silver 99% and Bronze 98.5% level Not of concern for calculations on-the-fly Country availability a to the Hub a = aC x aN aC: Availability of hardware & software on the node. Can be guaranteed by a hosting service. aN: Availability of network path between node and Hub. Can be estimated using network statistics.

Results for Scenario 1 1/2 ICT aggregate methodologically acceptable
For nodes available > 55% (16+ for EU28) When population of nodes available > 60% Simulation results Out of 228 possible combinations, 84% rejected However, their total probability is small ~10-5 for country availability a=98.5% (Bronze) ~10-6 for a=99% (Silver) ~10-7 for a=99.5% (Gold) ~ for a=99.99% (Platinum hosting package)

Results for Scenario 1 2/2 aC aN=100% aN=95% aN=90% aN=80% Platinum
10-12 10-4 0.6% 6.7% Gold 10-7 0.7% 7.2% Silver 10-6 10-3 0.8% 7.7% Bronze 10-5 1.0% 8.2% Example For country node availability at Bronze level (98.5%) and network availability of 90%, the probability of rejection of the aggregate is 1% Indicatively, Bronze level is the lowest/cheapest availability level proposed by contractors to Eurostat for hosting A 90% network path availability would mean that 1 out of 10 requests to country endpoints faces a network problem/congestion

Performance for Scenarios 1 and 2
In the best case (all nodes available) It takes tD tD = data retrieval time With tD < tO indicative timeout tO=20 sec In the worst case (1 or more nodes unavailable) It takes tO Scenario 2 In the best case, it is Scenario 1 It takes tO+tL for small tL = local data retrieval time

Performance for Scenario 3 1/2
The version inquiry Takes tV tV = version retrieval time With tV < tU indicative version timeout tU=2 sec If all nodes found unchanged during version inquiry And all available (best case) It takes tV+tL for small tL = local data retrieval time If 1 (or more) unavailable It takes tU+tL

Performance for Scenario 3 2/2
If some nodes found changed during version inquiry, a data request is performed which Takes tD tD = data retrieval time With tD < tO indicative timeout tO=20 sec For the nodes found changed If all nodes available during the data request It takes tV+tD OR tU+tD If 1 (or more) unavailable It takes tV+tO+tL OR tU+tO+tL (worst case)

Examples for Scenario 3 Best case example Worst case example
At tV the version inquiry finds all nodes available and unchanged At tV+tL latest data versions loaded from local copy Worst case example At tU the version inquiry returns a timeout while it finds some of the available nodes changed Data request sent to available nodes / Latest data versions loaded from local copy for unavailable nodes At tU+tO the data request returns a timeout At tU+tO+tL latest data versions loaded from local copy for unavailable nodes

Expected performance results
Best case Worst case Often Scenario 1 tD+tA tO+tA (tO>tD) tO+tA Scenario 2 tO+tL+tA (tL«tD) tO+tL+tA Scenario 3 tV+tL+tA (tV«tD) tU+tO+tL+tA (tU>tV) tU+tD+tA Scenario 4 tV+tL tU+tO+tL tU+tD tD = data retrieval time tO = data request timeout tV = version inquiry time tU = version inquiry timeout tL = local data retrieval time tA = aggregations time

Preliminary conclusions 1/5
Scenario 1 pros Already implemented for the Census Hub Requires no further investment Scenario 1 cons High probability of (a few) unavailable nodes Aggregate calculated over available nodes only Discussion Most probably there will always be sufficient nodes available to calculate aggregrate The result will vary as nodes become available/unavailable

Scenario 2 pros Aggregate calculated over all nodes Logic similar to Scenario 1 Scenario 2 cons Investment for periodic data pulling Central database to maintain Can be slower than Scenario 1 Discussion Aggregrate calculation always possible The result will not (really) vary as nodes become available/unavailable

Scenario 3 pros Aggregate calculated over all nodes Can be faster than Scenario 2 for frequent data pulls or large data sets (responses) Scenario 3 cons Investment for consistent data versioning Can be slower than Scenario 2 for non-frequent data pulls or small data sets (responses) Discussion Data versioning can be useful overall Interesting for large data sets

Scenario 4 pros Aggregate calculated over all nodes Faster than Scenario 3 Scenario 4 cons Aggregates calculation after every data pull Useful only in the case of zero changed nodes Discussion Interesting for old/static collections Depending on the computed aggregate (derivative quantity in general), aggregation on-the-fly can be complex

For ICT test data From 11 pilot countries, 9 endpoints provided, with 7 being practically stable Scenario 1 Found always some unavailable nodes Calculated always a statistically acceptable aggregate Scenario 2 Run faster than any other scenario

Proposed improvements
Proposed conditions for more indicative results More (ideally 28) endpoints available Production (versus test) endpoints Higher endpoint availability Production-level infrastructure on country nodes Production-level infrastructure on central node Investment for extended functionality Data pulling, version checking, aggregations Various statistical domains to be examined For example, result sets in Census domain are significantly larger and contain more dimensions

The Enhanced Hub Proposed as an extension to the Hub logic
to 'integrate' various online (hub) and offline (batch) functions to 'synchronize' data versions in an ESS Data Warehouse to 'serve' in parallel statistical domains with different needs

Ideas behind the Enhanced Hub
Services are available Online: for on-the-fly calls from Hub GUI(s) Offline: for periodic/on-demand batch calls Services examples Data pulling: pulls data from one location into another exploiting a mapping tool (SDMX-RI) Version checking: ensures that consistent data versions exist in different locations Data calculations: performs aggregations, derivations, validation Orchestration

Conclusions With the work package complete
All objectives of the work package were achieved Alternative scenarios proposed for extending the Hub to support on-the-fly aggregations Preliminary tests indicate Scenario 2 as best suited for ICT data Scenarios 1, 3 and 4 better suited for other statistical domains The Enhanced Hub could potentially meet the needs of any statistical domain

Thank you for your attention!
Contact: Unit B3, Eurostat

Ioannis Xirouchakis / Unit B3

Similar presentations

Presentation on theme: "Ioannis Xirouchakis / Unit B3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ioannis Xirouchakis / Unit B3

Similar presentations

Presentation on theme: "Ioannis Xirouchakis / Unit B3"— Presentation transcript:

Similar presentations

About project

Feedback