To access the full text documents, please follow this link: http://hdl.handle.net/2117/107497

Designing and modelling selective replication for fault-tolerant HPC applications
Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P.
Peer Reviewed
-Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
-Fault-tolerant computing
-Parallel processing (Electronic computers)
-Computational modeling
-Computer crashes
-Mathematical model
-Markov processes
-Hardware
-Reliability theory
-Tolerància als errors (Informàtica)
-Processament en paral·lel (Ordinadors)
Article - Submitted version
Conference Object
Institute of Electrical and Electronics Engineers (IEEE)
         

Show full item record

Related documents

Other documents of the same author

Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José
Subasi, Omer; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Yalcin, Gulay; Cristal Kestelman, Adrián
Subasi, Omer; Martsinkevich, Tatiana; Zyulkyarov, Ferad; Unsal, Osman; Labarta Mancho, Jesús José; Cappello, Franck
Subasi, Omer; Di, Sheng; Bautista-Gomez, Leonardo; Balaprakash, Prasanna; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cristal Kestelman, Adrián; Krishnamoorthy, Sriram; Cappello, Franck
Subasi, Omer; Di, Sheng; Bautista Gómez, Leonardo; Balaprakash, Prasanna; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cristal Kestelman, Adrián; Cappello, Franck
 

Coordination

 

Supporters