HDB@ELK: another noSql customization for the HDB++ archiving system M. Di Carlo*a, M. Canzaria, M. Dolcia, R. Smaregliab aINAF Osservatorio Astronomico d’Abruzzo, Teramo, Italy; bINAF Osservatorio Astronomico di Trieste, Trieste, Italy
Introduction Study how to extend HDB++ Study the archiving in Elasticsearch Use Kibana to visualise
HDB++ Module View Event Subscriber Configuration Manager uses inherit Database Abstraction Layer C++ module KEY MySql Cassandra
HDB++ Runtime View (from the tango docs camp)
HDB++ Data Model * AttributeConfigurationHistory association 1 inherit AttributeParameter 1 Table * 1 1 AttributeEventData Value KEY Double Long String ...
Elasticsearch Real-time distributed search and analytics engine “Real-time” refers to the ability to search (and sometimes create) data as soon as they are produced Distributed because its indices are divided into shards with zero or more replicas the analytics engine which allows the discovery, interpretation, and communication of meaningful patterns in data Based on Apache Lucene: a free and open-source information retrieval software library It is developed alongside a data-collection and log-parsing engine called Logstash, and an analytics and visualization platform called Kibana
Elasticsearch main features no transaction: no support for transaction; schema flexible: there is no need to specify the schema upfront; relations: denormalization, parent-child relations and nested objects; robustness: to properly work, elasticsearch requires that memory is abundant; distributed: it is a CP-system in the CAP (Consistency-Availability- Partition tolerance) theorem
Elasticsearch and relations Everything is flat: every document is independent and therefore every document should contain all of the information required to decide whether it matches a query This helps in indexing, in searching and in scalability since documents can be spread across multiple nodes Relations are not managed in the same way of a RDBMS
Implementation Selected development language was C++ The new library had to be able to work with REST and with Json data “REST client for C++”: https://github.com/mrtazz/restclient-cpp “Json for modern C++”: https://github.com/nlohmann/json The total amount of time needed to implement the “AbstractDB” was around 4 weeks Testing and studying was around two months
Implementation: class diagram To add a new entity in the DB: Add an entity that represent the information to store with the four main operations (DBEntity); Implement the Get and Save operation in the DAL.
Global Centroid-Moment-Tensor (CMT) Project, www.globalcmt.org Tests Dolci et al., AMICA at Dome C: results from the first year of automatic operation tests in Antarctica Global Centroid-Moment-Tensor (CMT) Project, www.globalcmt.org
Kibana: time series
Archiving data for the CMT project Event Attribute Name Value double Value string Value Date t1 Latitude -28.39 - Longitude -176.79 Location “KERMADEC ISLANDS REGION” Date 1976/02/15 21:23:22.6 t2 -14.74 167.10 “VANUATU ISLANDS” 1976/03/04 02:50:00.5 … tN How do we plot them?
Questions How do we archive structured data? Json? Array? How do we aggregate unstructured data?
Elasticsearch and relations: four possibilities to bridge the gap Application-side join There are no relations in the data and the only possibility is to make more than one query to filter and emulate a join Data denormalization Increasing read performance adding some redundant copy of the data Disadvantage in term of concurrency and index dimension Nested objects: it is possible to relate a document with a nested document that is indexed together There are a number of special operator to deal with those objects Parent-child relationship: a document can be a parent of another one and one child can have only one parent Documents are completely separated
Transformation into a (usable) table Event Attribute Name Value double Value string …. t1 a1 … aM t2 tN Event a1 a2 … aM t1 Value double Value string t2 ... tN .. .... (M+1)xN
Event Attribute Name Value double Value string Value Date Latitude -28.39 - Longitude -176.79 Location “KERMADEC ISLANDS REGION” Date 1976/02/15 21:23:22.6 t2 -14.74 167.10 “VANUATU ISLANDS” 1976/03/04 02:50:00.5 … tN Event Latitude Longitude Location Date t1 -28.39 -176.79 “KERMADEC ISLANDS REGION” 1976/02/15 21:23:22.6 t2 -14.74 167.10 “VANUATU ISLANDS” 1976/03/04 02:50:00.5 … tN ...
Kibana: Geopoint
Transformation - general Event Attribute Name Value double Value string …. t1 a1 … aM t2 tN aM a2 (M+1) x N x Z a1 g1 g2 gZ t-1 t-2 The dimension depends on the grouping one want to do (for instance, time-device-attribute) t-i t-N
Conclusion A json device attribute (with the needed changes in the source code to be able to archive another type) was introduced whenever the aggregation was needed (nested objects relations) The system appears to be thought for archiving time series (only) New development of the TANGO core model can be helpful to reduce the aggregation tradeoff Specific json data type, scheduling custom archiving scripts can be beneficial too.
Thank for the attention For any question you can write to: Matteo Di Carlo matteo.dicarlo@inaf.it