Panel 22 July, 2015 Panel Data Intensive Science at HPCS 2015 – The International Conference on High Performance Computing & Simulation 22 July, Marian Bubak AGH University of Science and Technology Krakow, Poland and University of Amsterdam, Amsterdam, The Netherlands
DICE Team Academic Computer Centre CYFRONET AGH (1973) 120 employees Academic Computer Centre CYFRONET AGH (1973) 120 employees Department of Computer Science AGH (1980) 800 students, 70 employees Department of Computer Science AGH (1980) 800 students, 70 employees Faculty of Computer Science, Electronics and Telecommunication (2012) 2000 students, 200 employees Faculty of Computer Science, Electronics and Telecommunication (2012) 2000 students, 200 employees AGH University of Science and Technology (1919) 16 faculties, students; 4000 employees AGH University of Science and Technology (1919) 16 faculties, students; 4000 employees Other 15 faculties Distributed Computing Environments (DICE) Team Investigation of methods for building complex scientific collaborative applications Elaboration of environments and tools for e-Science Integration of large-scale distributed computing infrastructures Knowledge-based approach to services, components, and their semantic composition
From Workshop on Cloud Services for File Synchronisation and Sharing, CERN Nov 17-18, 2014 Protocols for file sharing and synchronization Reliability and consistency of file synchronization services Efficiency and scalability of file synchronization services File-sharing semantics Data analysis workflows Backend storage technologies Federated access to cloud storage Integration of large data repositories Mobile access to data
In service orchestration, all data is passed to the workflow engine Data transfers are made through SOAP, which is unfit for large data transfers Spiros Koulouzis, Reggie Cushing, Kostas Karasavvas, Adam Belloum, and Marian Bubak. Enabling web services to consume and produce large datasets. IEEE Internet Computing, 16(1):52–60, 2012 Spiros Koulouzis, Dmitry Vasyunin, Reginald Cushing, Adam Belloum, and MarianBubak. Cloud data federation for scientific applications. In Euro-Par 2013: Parallel Processing Workshops, LNCS 8374, pp 13–22. Springer, 2014 Storage federation Scalable data access
Cloud and Big Data Benchmarking and Verification Methodology Methodology of Evaluation of systems and applications – Qualitative metrics (architectures, functionality) – Quantitative metrics (performance, stability, cost) – Test scenarios, test cases and parameters – Experiment planning, analysis of results Selection of benchmarks – Portfolio of standard benchmarks – Design of application-specific scenarios Target platforms – IaaS clouds (public, private) – Hybrid Clouds with cloud bursting – Real-Time BigData processing systems (Hadoop, Spark, ElasticSearch) Collaboration with Samsung R&D Polska – Methodology applied to cloud infrastructure at the industrial partner – Consultancy on the analysis of results and development of Testing-as-a-service (TaaS) system K. Zieliński, M. Malawski, M. Jarząb, S. Zieliński, K. Grzegorczyk, T. Szepieniec, and M. Zyśk: Evaluation Methodology of Converged Cloud Environments. In: K. Wiatr, J. Kitowski, M. Bubak (Eds) Proceedings of the Seventh ACC Cyfronet AGH Users’ Conference, ACC CYFRONET AGH, Kraków, ISBN , pp (2014) 5
Data security in clouds To ensure security of data in transit Modern applications use secure tranport protocols (e.g.TLS) For legacy unencrypted protocols if absolutly needed, or as additional security measure: – Site-to-Site VPN, e.g. between cloud sites is outside of the instance, might use – Remote access – for individual users accessing e.g. from their laptops Data should be secure stored and realiable deleted when no longer needed Clouds not secure enough, data optimisations preventing ensuring that data were deleted A solution: – end-to-end encryption (decryption key stays in protected/private zone) – data dispersal (portion of data, dispersed between nodes so it’s non-trivial/impossible to recover whole message) J. Meizner, M. Bubak, M. Malawski, P. Nowakowski: Secure Storage and Processing of Confidential Data on Public Clouds. In: PPAM 2013, LNCS 8384, pp , Springer, 2014
Competences Exploitation of PaaS-based solutions with in-house installations Handling heterogeneous data in diverse scientific disciplines Building multi-layer and multi-protocol software stacks Objectives Ad-hoc metadata model creation and deployment of corresponding storage facilities Create a research space for metadata model exchange and discovery with associated data repositories with access restrictions in place Different types of storage sites and data transfer protocols Architecture Web Interface-based metadata model management PaaS-based repositories over REST Site-specific storage infrastructure for file persistence Colaborative metadata management D. Harężlak, M. Kasztelnik, M. Pawlik, B. Wilk, and M. Bubak: A Lightweight Method of Metadata and Data Management with DataNet. In: M. Bubak, J. Kitowski, K. Wiatr (Eds.): eScience on Distributed Computing Infrastructure, LNCS Springer, pp , 2014
Levee Monitoring Application ISMOP project Levee breach threat due to a passing wave High water levels lasting for up to 2 weeks Large areas of levees affected (100+ km) 8
Flood threat assessment platform Bartosz Balis,Marek Kasztelnik, Maciej Malawski, Piotr Nowakowski, Bartosz Wilk, Maciej Pawlik, Marian Bubak: Execution Management and Efficient Resource Provisioning for Flood Decision Support. ICCS 2015: , Procedia Computer Science51, Elsevier 2015
Goal: Extending the traditional scientific publishing model with computational access and interactivity mechanisms; enabling readers (including reviewers) to replicate and verify experimentation results and browse large-scale result spaces. Challenges: Scientific: A common description schema for primary data (experimental data, algorithms, software, workflows, scripts) as part of publications; deployment mechanisms for on-demand reenactment of experiments in e-Science. Technological: An integrated architecture for storing, annotating, publishing, referencing and reusing primary data sources. Organizational: Provisioning of executable paper services to a large community of users representing various branches of computational science; fostering further uptake through involvement of major players in the field of scientific publishing. P. Nowakowski, E. Ciepiela, D. Harężlak, J. Kocot, M. Kasztelnik, T. Bartyński, J. Meizner, G. Dyk, M. Malawski: The Collage Authoring Environment. In: Proceedings of the International Conference on Computational Science, ICCS 2011 (2011), Winner of the Elseview/ICCS Executable Paper Grand Challenge E. Ciepiela, D. Harężlak, M. Kasztelnik, J. Meizner, G. Dyk, P. Nowakowski, M. Bubak: The Collage Authoring Environment: From Proof-of-Concept Prototype to Pilot Service in Procedia Computer Science, vol. 18, 2013 Collage - executable e-Science publications
Simulating a city, citizen science SensorsSimulatingOpen Data Data Analytics Decision Understanding a city (mobility, crime, flood, health, evacuation, etc.) through computation Set of simulation combined together and reacting for changes Key challenges: Open data ( - Tomek Gubała’s initiative) Distributed environment with auto scaling capability (e.g. Atmosphere, AWS Auto Scaling, etc.) Simulation repository Decision Support System Proof of concept projects, which use Open Data (work in progress), https ://plankrk.herokuapp.com https ://plankrk.herokuapp.com
State Graph describing a filtering state machine for tweets which is mapped to 11 VMs Reginald Cushing, Adam Belloum, Marian Bubak, and Cees de Laat. Automata-based dynamic data processing for clouds. In Euro-Par 2014: Parallel Processing Workshops, LNCS 8805, pp 93–104, 2014 Reginald Cushing, Adam Belloum, Marian Bubak, and Cees de Laat. Towards Computing Without Borders: Data Processing Plane, In review: Future Generation of Computer Systems, 2015 Automata-based dynamic data processing Data processing schema can be considered as a state transformation graph The graph facilitates data processing in many ways – Data state can be easily tracked – Using the graph as a protocol header, a virtual data processing network layer is achieved – Data becomes self routable to processing nodes – Collaboration can be achieved by joining the virtual network