Title:
|
Designing and modelling selective replication for fault-tolerant HPC applications
|
Author:
|
Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José
|
Other authors:
|
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
Abstract:
|
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user. |
Abstract:
|
This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant
agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P. |
Abstract:
|
Peer Reviewed |
Subject(s):
|
-Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors -Fault-tolerant computing -Parallel processing (Electronic computers) -Computational modeling -Computer crashes -Mathematical model -Markov processes -Hardware -Reliability theory -Tolerància als errors (Informàtica) -Processament en paral·lel (Ordinadors) |
Rights:
|
|
Document type:
|
Article - Submitted version Conference Object |
Published by:
|
Institute of Electrical and Electronics Engineers (IEEE)
|
Share:
|
|