Title:
|
A directive-based approach to perform persistent checkpoint/restart
|
Author:
|
Maroñas, Marcos; Mateo, Sergi; Beltran Querol, Vicenç; Ayguadé Parra, Eduard
|
Other authors:
|
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
Abstract:
|
Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI and SCR checkpoint/restart libraries. Apart from that, our portable approach significantly increases the programmability reducing the number of code lines required to perform checkpoint/restart in an average of ˜ 82% and ˜ 94%, for FTI and SCR respectively. |
Abstract:
|
The research leading to these results has received funding from the European Community Seventh Framework
Programme (FP7/2007-2013) via the DEEP-ER project under Grant Agreement number 610476. This work has been also supported by the Spanish Ministry of Science and
Innovation (contract TIN2012-34557) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). |
Abstract:
|
Peer Reviewed |
Subject(s):
|
-Àrees temàtiques de la UPC::Informàtica -Fault-tolerant computing -High performance computing -Libraries -Fault tolerant systems -Checkpointing -Redundancy -Tools -Resilience -Checkpoint/restart -Resiliency -Fault tolerance -Ex-ascale -Programmability -Programming models -Tolerància als errors (Informàtica) -Càlcul intensiu (Informàtica) |
Rights:
|
|
Document type:
|
Article - Published version Conference Object |
Published by:
|
Institute of Electrical and Electronics Engineers (IEEE)
|
Share:
|
|