Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

John A. Gunnels
Department of Computer Sciences
The University of Texas at Austin
Taylor Hall 2.124
Austin, TX 78712
gunnels@cs.utexas.edu
 
Daniel S. Katz
Jet Propulsion Laboratory
California Institute of Technology
Pasadena, CA 91109-8099
Daniel.S.Katz@jpl.nasa.gov
 
 
Enrique S. Quintana-Ortí
Dept. de Informática
Universidad Jaume I
12080 Castellón
Spain
quintana@inf.uji.es
 
Robert A. van de Geijn
Department of Computer Sciences
The University of Texas at Austin
Taylor Hall 2.124
Austin, TX 78712
rvdg@cs.utexas.edu
 

In this paper, we extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C = A B , in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B . Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.